<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[The Machine Learning Pivot]]></title><description><![CDATA[A tech blog for data practitioners to learn the software engineering skills you weren't taught.]]></description><link>https://mlpivot.dev</link><image><url>https://substackcdn.com/image/fetch/$s_!u3VK!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e829911-587c-4d84-9301-035745d66119_200x200.png</url><title>The Machine Learning Pivot</title><link>https://mlpivot.dev</link></image><generator>Substack</generator><lastBuildDate>Sat, 18 Apr 2026 04:37:57 GMT</lastBuildDate><atom:link href="https://mlpivot.dev/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[The Machine Learning Pivot]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[mlpivot@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[mlpivot@substack.com]]></itunes:email><itunes:name><![CDATA[The Machine Learning Pivot]]></itunes:name></itunes:owner><itunes:author><![CDATA[The Machine Learning Pivot]]></itunes:author><googleplay:owner><![CDATA[mlpivot@substack.com]]></googleplay:owner><googleplay:email><![CDATA[mlpivot@substack.com]]></googleplay:email><googleplay:author><![CDATA[The Machine Learning Pivot]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[Building a Production Ready Text-to-Speech API]]></title><description><![CDATA[Set yourself apart from other MLEs by learning how to work with audio and serve Text-to-Speech models.]]></description><link>https://mlpivot.dev/p/building-a-production-ready-text</link><guid isPermaLink="false">https://mlpivot.dev/p/building-a-production-ready-text</guid><dc:creator><![CDATA[The Machine Learning Pivot]]></dc:creator><pubDate>Fri, 19 Dec 2025 20:28:19 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!jnT-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4915b33-1a8d-4079-a4a7-0afb4623bb4b_1024x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>I&#8217;ve been obsessed with text-to-speech systems lately. Every time I hear a screen reader or an audiobook with synthesized narration, I can&#8217;t help but think about the engineering that went into building that TTS system.</p><p>I built my first TTS serving platform during a company hackathon. The requirements seemed straightforward: accept text input, return audio in various formats, and let admins add voice samples for cloning (where the model learns to mimic someone&#8217;s voice from a short audio clip).</p><p>But building a framework that actually handles production traffic taught me that audio APIs need different patterns than text-based systems.</p><p>Working with audio changes your API design because you&#8217;re dealing with binary data instead of text. File sizes directly affect your bandwidth costs. Users notice delays with audio differently than with text. A 500ms delay for text generation feels fine, but the same delay before audio plays feels broken. You&#8217;ll need to think about streaming, caching, and format conversions in ways you probably haven&#8217;t if you&#8217;re coming from text APIs.</p><p>This post walks through building a complete TTS serving system from scratch. I&#8217;ll cover audio encoding, HTTP streaming for progressive playback, and the engineering patterns for serving compute-intensive models. These patterns apply beyond TTS. I&#8217;ve used similar approaches for image generation APIs.</p><h2>What You Should Know Before Starting</h2><p>You should be comfortable with Python, REST APIs, and making HTTP requests. Basic knowledge of neural networks helps but isn&#8217;t required. I&#8217;ll explain the audio fundamentals and binary data handling you need as we go. I promise it&#8217;s not as intimidating as it seems &#128521;.</p><h2>Why TTS Feels Different Than Text Generation</h2><p>I&#8217;ve worked with both LLMs and TTS models over the past year, and they feel completely different to work with. With LLMs, you generate discrete tokens that you can count and measure with metrics. The output is text - you can print it, analyze it with regex, or store it in a database easily.</p><p>TTS generates continuous audio waveforms. You can&#8217;t just run BLEU scores and call it done. The output is binary data that needs special handling.</p><p>Quality is subjective too, which makes evaluation tricky. What sounds natural to me might sound robotic to you &#127908;&#129302;. </p><p>File size becomes a concern too. Ten seconds of WAV audio is 1.6MB. The same audio as MP3 might be only 80KB. Your choice of audio format directly impacts bandwidth costs and user experience.</p><h2>Audio Fundamentals &#128265;</h2><p>Before we build anything, you need to understand how audio works. I know it sounds dry, but trust me, understanding these basics will save you hours of debugging weird audio issues later.</p><h3>Sample Rate</h3><p>Sample rate measures how many samples per second your audio contains. Think of it like the frame rate of a video - more samples mean more detail, but also larger files.</p><p>Common sample rates you&#8217;ll encounter:</p><ul><li><p><strong>44.1kHz</strong> (CD quality) - Standard for music, but overkill for speech</p></li><li><p><strong>22.05kHz</strong> (good for speech) - Sweet spot for TTS quality vs file size</p></li><li><p><strong>16kHz</strong> (acceptable for speech) - Minimum for decent speech quality</p></li><li><p><strong>8kHz</strong> (phone quality) - You&#8217;ll recognize this as &#8220;that phone voice quality&#8221;</p></li></ul><p>The Nyquist theorem says your sample rate must be at least twice the highest frequency you want to capture. Human speech tops out around 8kHz, which is why 16kHz sampling captures everything you need for speech. Music goes much higher (up to 20kHz for human hearing), which is why music uses 44.1kHz.</p><p>For TTS specifically, I recommend 22.05kHz. Going 44.1kHz doesn&#8217;t improve quality because speech just doesn&#8217;t have those higher frequencies. You&#8217;re literally wasting storage and bandwidth.</p><h3>Bit Depth</h3><p>Bit depth determines how many bits represent each sample. This is like color depth in images - more bits mean more precision.</p><p>Common bit depths:</p><ul><li><p><strong>8-bit</strong> - Sound terrible, only use for prototyping</p></li><li><p><strong>16-bit</strong> - Standard for speech, gives you 96dB of dynamic range</p></li><li><p><strong>24-bit</strong> - Used for professional audio recording, overkill for TTS</p></li></ul><p>16-bit is standard for speech. You get 96dB of dynamic range, which is way more than enough for human speech (normal conversation is around 60dB). Going to 24-bit doesn&#8217;t help TTS at all.</p><h3>Channels</h3><p>Audio can be mono (one channel) or stereo (two channels). TTS is almost always mono because speech doesn&#8217;t need stereo imaging. Think about it - when someone talks to you, you don&#8217;t need them in surround sound.</p><p>Stereo doubles your file size for zero benefit in TTS applications. The only exception is if you&#8217;re doing something creative like placing different speakers in different channels, but that&#8217;s pretty niche.</p><h3>Audio Formats</h3><p>This is where things get practical. You need to understand these formats because you&#8217;ll use different ones at different stages of your pipeline.</p><h4>WAV (Uncompressed)</h4><p>WAV stores raw PCM (pulse-code modulation) data with RIFF header. It&#8217;s universal, maintains perfect quality, but the files are massive! One minute of 16-bit, 44.1kHz mono audio is 5.3MB. For a typical audiobook chapter (20 minutes), that&#8217;s over 100MB uncompressed.</p><p>I use WAV for:</p><ul><li><p>Intermediate processing in my pipeline</p></li><li><p>Debugging audio quality issues</p></li><li><p>Archival storage when quality matters most</p></li></ul><p>When you&#8217;re manipulating audio (resampling, normalizing, etc.), always work with WAV. The last thing you want is to lose quality at each processing step through multiple lossy compressions.</p><h4>MP3 (Lossy Compression)</h4><p>MP3 uses &#8220;lossy&#8221; compression, meaning it permanently throws away data to shrink files. The codec (the algorithm that encodes and decodes audio) uses psychoacoustic models to figure out what to remove. It analyzes which frequencies humans can&#8217;t hear or won&#8217;t notice because they&#8217;re masked by other sounds, then discards them. Pretty clever.</p><p>MP3 compression is dramatic. A one-minute audio clip that&#8217;s 5MB as WAV becomes 500KB as MP3. When you&#8217;re serving thousands of audio clips, that 10x reduction in file size directly impacts your bandwidth costs.</p><p>The tradeoff is quality degradation. Each encode/decode cycle loses some information. This is why you should never edit an MP3 directly - always decompress to WAV first, edit, then re-encode. But for delivering audio to user over the internet, MP3 works well.</p><h4>Opus (Modern Lossy)</h4><p>Opus is the newer codec and it&#8217;s legitimately better than MP3 at the same bitrate. It&#8217;s a hybrid codec optimized for both speech and music. Lower latency (important for real-time), better quality at low bitrates, and it&#8217;s an open standard (no licensing fees).</p><p>The catch? Slightly less universal browser support than MP3. Safari only added full support relatively recently. For WebRTC or real-time streaming applications, Opus is excellent. For broader compatibility across all browsers and devices, MP3 still wins.</p><p>My rule of thumb: Use MP3 for maximum compatibility, use Opus if you&#8217;re targeting modern browsers or need real-time capabilities.</p><h4>Working With Audio in Python</h4><p>Here&#8217;s how you load and manipulate audio in Python. I&#8217;ll walk through each line.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!QMso!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c8f3c9b-b7ba-4d74-a3f5-f5dbe55314bd_1380x856.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!QMso!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c8f3c9b-b7ba-4d74-a3f5-f5dbe55314bd_1380x856.png 424w, https://substackcdn.com/image/fetch/$s_!QMso!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c8f3c9b-b7ba-4d74-a3f5-f5dbe55314bd_1380x856.png 848w, https://substackcdn.com/image/fetch/$s_!QMso!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c8f3c9b-b7ba-4d74-a3f5-f5dbe55314bd_1380x856.png 1272w, https://substackcdn.com/image/fetch/$s_!QMso!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c8f3c9b-b7ba-4d74-a3f5-f5dbe55314bd_1380x856.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!QMso!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c8f3c9b-b7ba-4d74-a3f5-f5dbe55314bd_1380x856.png" width="1380" height="856" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7c8f3c9b-b7ba-4d74-a3f5-f5dbe55314bd_1380x856.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:856,&quot;width&quot;:1380,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:197397,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpivot.dev/i/180145768?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c8f3c9b-b7ba-4d74-a3f5-f5dbe55314bd_1380x856.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!QMso!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c8f3c9b-b7ba-4d74-a3f5-f5dbe55314bd_1380x856.png 424w, https://substackcdn.com/image/fetch/$s_!QMso!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c8f3c9b-b7ba-4d74-a3f5-f5dbe55314bd_1380x856.png 848w, https://substackcdn.com/image/fetch/$s_!QMso!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c8f3c9b-b7ba-4d74-a3f5-f5dbe55314bd_1380x856.png 1272w, https://substackcdn.com/image/fetch/$s_!QMso!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c8f3c9b-b7ba-4d74-a3f5-f5dbe55314bd_1380x856.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Audio is just a Numpy array! Each element is a sample value (usually a float between -1.0 and 1.0), and the sample rate determines how those samples map to time. It&#8217;s actually pretty straightforward once you realize it&#8217;s just an array of numbers.</p><p>The <code>sr=None</code> parameter in <code>librosa.load()</code> is important. It preserves the original sample rate. If you omit it, librosa defaults to 22050Hz and resamples your audio automatically, which might not be what you want.</p><h2>Choosing Your TTS Model</h2><p>Model selection is crucial and involves trade-offs between quality, speed, voice variety, and infrastructure requirements. High-quality models are generally slower. More voices mean larger model downloads. Voice cloning adds significant complexity.</p><p>I&#8217;ve tested most of the popular open-source options. Here&#8217;s my assessment of each:</p><p><strong>Coqui XTTS v2:</strong> This is what I&#8217;m using in production. Multi-lingual support (17+ languages), supports voice cloning, model size is around 2GB. You need at least 4GB+ VRAM. The quality is solid, not quite ElevenLabs level, but close enough for most applications. Generation speed is about 1-2 seconds for a typical sentence on a T4 GPU.</p><p><strong>Bark:</strong> Highly expressive and can generate music and sound effects, which is pretty cool. But it&#8217;s significantly slower than XTTS (sometimes 5-10 seconds or a single sentence). Great for creative applications where you need those extra sound effects, but not practical for real-time or high-throughput scenarios.</p><p><strong>Piper:</strong> Fast CPU inference, multiple voices out of the box, but limited customization options. This is excellent if you don&#8217;t have GPU infrastructure or need something lightweight. Quality is decent but not amazing. I use this for dev environments where I don&#8217;t want to spin up GPU instances.</p><p><strong>StyleTTS2:</strong> High quality with voice cloning capability, moderate speed. It requires more ML knowledge to tune properly though. If you&#8217;re comfortable with model training and fine-tuning, this gives you more control. But the learning curve is steeper.</p><p>For this tutorial, I&#8217;m using Coqui XTTS v2. It hits the sweet spot of quality, speed, and voice cloning capability. The model needs 4GB+ VRAM and supports voice cloning with just 6-30 second samples (which is impressively short). XTTS is licensed under Mozilla Public License 2.0, so you can use it commercially without licensing headaches.</p><h2>Our Project Structure</h2><p>Here&#8217;s how I organize TTS projects. Don&#8217;t come for me &#129763;.</p><pre><code>tts-api/
&#9500;&#9472;&#9472; server/
&#9474;   &#9500;&#9472;&#9472; __init__.py
&#9474;   &#9500;&#9472;&#9472; main.py          # FastAPI application
&#9474;   &#9500;&#9472;&#9472; model.py         # TTS model loading and inference
&#9474;   &#9500;&#9472;&#9472; audio.py         # Audio processing utilities
&#9474;   &#9500;&#9472;&#9472; storage.py       # File storage management
&#9474;   &#9500;&#9472;&#9472; schemas.py       # Pydantic validation models
&#9474;   &#9492;&#9472;&#9472; config.py        # Configuration
&#9500;&#9472;&#9472; client/
&#9474;   &#9492;&#9472;&#9472; client.py        # Python client library
&#9500;&#9472;&#9472; storage/
&#9474;   &#9500;&#9472;&#9472; voices/          # Voice sample storage
&#9474;   &#9492;&#9472;&#9472; generated/       # Generated audio cache
&#9500;&#9472;&#9472; tests/
&#9474;   &#9500;&#9472;&#9472; test_audio.py
&#9474;   &#9500;&#9472;&#9472; test_api.py
&#9474;   &#9492;&#9472;&#9472; test_integration.py
&#9500;&#9472;&#9472; requirements.txt
&#9500;&#9472;&#9472; .env
&#9492;&#9472;&#9472; README.md</code></pre><p>This structure separates concerns cleanly. Audio processing lives in<code> audio.py</code>, model inference in <code>model.py</code>, storage management in <code>storage.py</code>. FastAPI handles HTTP in <code>main.py</code>, Pydantic validates inputs in <code>schemas.py,</code> and the client library in <code>client/</code> provides a clean interface for users.</p><h2>Building Audio Processing Utilities</h2><p>Audio processing needs careful handling. I centralize these operations to prevent bugs from inconsistent processing across different parts of the codebase.</p><h3>Normalization: Making Audio Levels Consistent</h3><p>User-uploaded voice samples have wildly different volumes. Some people record super quiet, others are basically YELLING into their microphone. Training on inconsistent volumes confuses the model, and generation produces unpredictable loudness that makes your API feel buggy.</p><p>Here&#8217;s the normalization function with detailed explanation &#128071;&#127998;</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!pbFk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20e6bfce-cf97-4334-afe6-1f2d4b2cd856_1852x2086.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!pbFk!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20e6bfce-cf97-4334-afe6-1f2d4b2cd856_1852x2086.png 424w, https://substackcdn.com/image/fetch/$s_!pbFk!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20e6bfce-cf97-4334-afe6-1f2d4b2cd856_1852x2086.png 848w, https://substackcdn.com/image/fetch/$s_!pbFk!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20e6bfce-cf97-4334-afe6-1f2d4b2cd856_1852x2086.png 1272w, https://substackcdn.com/image/fetch/$s_!pbFk!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20e6bfce-cf97-4334-afe6-1f2d4b2cd856_1852x2086.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!pbFk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20e6bfce-cf97-4334-afe6-1f2d4b2cd856_1852x2086.png" width="1456" height="1640" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/20e6bfce-cf97-4334-afe6-1f2d4b2cd856_1852x2086.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1640,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:477310,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpivot.dev/i/180145768?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20e6bfce-cf97-4334-afe6-1f2d4b2cd856_1852x2086.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!pbFk!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20e6bfce-cf97-4334-afe6-1f2d4b2cd856_1852x2086.png 424w, https://substackcdn.com/image/fetch/$s_!pbFk!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20e6bfce-cf97-4334-afe6-1f2d4b2cd856_1852x2086.png 848w, https://substackcdn.com/image/fetch/$s_!pbFk!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20e6bfce-cf97-4334-afe6-1f2d4b2cd856_1852x2086.png 1272w, https://substackcdn.com/image/fetch/$s_!pbFk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20e6bfce-cf97-4334-afe6-1f2d4b2cd856_1852x2086.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This ensures consistent loudness across all your audio. The math converts between linear amplitude (what the actual samples represent) and logarithmic decibels (how humans perceive loudness). Standard practice for speech is -20dB LUFS (Loudness Units relative to Full Scale).</p><p>Why -20dB specifically? It&#8217;s loud enough to be clearly heard, but leaves enough headroom that you won&#8217;t clip (distort) when processing. </p><h3>Resample: Getting the Right Sample Rate</h3><p>Models expect specific sample rates, typically 22.05kHz or 24kHz for XTTS. User uploads might be 44.1kHz (CD quality), 48kHz (video standard), or even 8kHz (phone quality). Mismatched sample rates cause pitch and speed issues.</p><p>Trust me, you don&#8217;t want your API accidentally generating chipmunk voices because someone uploaded 44.1kHz audio and you didn&#8217;t resample &#128063;&#65039;.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Jea4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a99a054-ca41-4b1c-b5d0-1a21f0e3d77b_1514x1154.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Jea4!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a99a054-ca41-4b1c-b5d0-1a21f0e3d77b_1514x1154.png 424w, https://substackcdn.com/image/fetch/$s_!Jea4!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a99a054-ca41-4b1c-b5d0-1a21f0e3d77b_1514x1154.png 848w, https://substackcdn.com/image/fetch/$s_!Jea4!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a99a054-ca41-4b1c-b5d0-1a21f0e3d77b_1514x1154.png 1272w, https://substackcdn.com/image/fetch/$s_!Jea4!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a99a054-ca41-4b1c-b5d0-1a21f0e3d77b_1514x1154.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Jea4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a99a054-ca41-4b1c-b5d0-1a21f0e3d77b_1514x1154.png" width="1456" height="1110" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6a99a054-ca41-4b1c-b5d0-1a21f0e3d77b_1514x1154.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1110,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:222779,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpivot.dev/i/180145768?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a99a054-ca41-4b1c-b5d0-1a21f0e3d77b_1514x1154.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Jea4!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a99a054-ca41-4b1c-b5d0-1a21f0e3d77b_1514x1154.png 424w, https://substackcdn.com/image/fetch/$s_!Jea4!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a99a054-ca41-4b1c-b5d0-1a21f0e3d77b_1514x1154.png 848w, https://substackcdn.com/image/fetch/$s_!Jea4!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a99a054-ca41-4b1c-b5d0-1a21f0e3d77b_1514x1154.png 1272w, https://substackcdn.com/image/fetch/$s_!Jea4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a99a054-ca41-4b1c-b5d0-1a21f0e3d77b_1514x1154.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The implementation looks simple, but librosa does the heavy lifting behind the scenes. It uses polyphase filtering, which is a sophisticated resampling algorithm that avoids aliasing. </p><blockquote><p>Aliasing is the audio distortion you get from poor resampling (think of it like jagged edges in a low-resolution image, but for sound). Librosa handles this complexity so you don&#8217;t have to worry about it.</p></blockquote><h3>Format Conversion: Moving Between Formats</h3><p>Internal processing uses WAV for quality. User delivery uses MP3 for size. Different clients might prefer different formats based on their use case.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!DQTD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F314a87b6-9884-458a-a2c2-c0c774d571ba_1446x1526.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DQTD!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F314a87b6-9884-458a-a2c2-c0c774d571ba_1446x1526.png 424w, https://substackcdn.com/image/fetch/$s_!DQTD!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F314a87b6-9884-458a-a2c2-c0c774d571ba_1446x1526.png 848w, https://substackcdn.com/image/fetch/$s_!DQTD!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F314a87b6-9884-458a-a2c2-c0c774d571ba_1446x1526.png 1272w, https://substackcdn.com/image/fetch/$s_!DQTD!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F314a87b6-9884-458a-a2c2-c0c774d571ba_1446x1526.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DQTD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F314a87b6-9884-458a-a2c2-c0c774d571ba_1446x1526.png" width="1446" height="1526" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/314a87b6-9884-458a-a2c2-c0c774d571ba_1446x1526.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1526,&quot;width&quot;:1446,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:309281,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpivot.dev/i/180145768?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F314a87b6-9884-458a-a2c2-c0c774d571ba_1446x1526.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!DQTD!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F314a87b6-9884-458a-a2c2-c0c774d571ba_1446x1526.png 424w, https://substackcdn.com/image/fetch/$s_!DQTD!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F314a87b6-9884-458a-a2c2-c0c774d571ba_1446x1526.png 848w, https://substackcdn.com/image/fetch/$s_!DQTD!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F314a87b6-9884-458a-a2c2-c0c774d571ba_1446x1526.png 1272w, https://substackcdn.com/image/fetch/$s_!DQTD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F314a87b6-9884-458a-a2c2-c0c774d571ba_1446x1526.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The <code>-q:a 2</code> parameter for MP3 sets the quality level (0-9, lower is better). I use 2 which provides excellent quality while keeping file sizes reasonable. This is a good balance for most TTS applications.</p><h2>Building the Model Server</h2><p>Now for the core of our system, the model server that actually generates speech.</p><h3>Why Use a Singleton Pattern?</h3><p>The TTS model is 2GB+ in memory. Loading takes 10-20 seconds (mostly downloading model weights on the first run). Loading multiple copies would quickly exhaust GPU memory. One model instance can efficiently serve all requests.</p><p>Here&#8217;s the singleton implementation.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!5RT9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4079518a-2a9a-49f5-a5cf-0dedbb817d9f_1598x2978.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!5RT9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4079518a-2a9a-49f5-a5cf-0dedbb817d9f_1598x2978.png 424w, https://substackcdn.com/image/fetch/$s_!5RT9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4079518a-2a9a-49f5-a5cf-0dedbb817d9f_1598x2978.png 848w, https://substackcdn.com/image/fetch/$s_!5RT9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4079518a-2a9a-49f5-a5cf-0dedbb817d9f_1598x2978.png 1272w, https://substackcdn.com/image/fetch/$s_!5RT9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4079518a-2a9a-49f5-a5cf-0dedbb817d9f_1598x2978.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!5RT9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4079518a-2a9a-49f5-a5cf-0dedbb817d9f_1598x2978.png" width="1456" height="2713" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4079518a-2a9a-49f5-a5cf-0dedbb817d9f_1598x2978.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:2713,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:639105,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpivot.dev/i/180145768?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4079518a-2a9a-49f5-a5cf-0dedbb817d9f_1598x2978.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!5RT9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4079518a-2a9a-49f5-a5cf-0dedbb817d9f_1598x2978.png 424w, https://substackcdn.com/image/fetch/$s_!5RT9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4079518a-2a9a-49f5-a5cf-0dedbb817d9f_1598x2978.png 848w, https://substackcdn.com/image/fetch/$s_!5RT9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4079518a-2a9a-49f5-a5cf-0dedbb817d9f_1598x2978.png 1272w, https://substackcdn.com/image/fetch/$s_!5RT9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4079518a-2a9a-49f5-a5cf-0dedbb817d9f_1598x2978.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The singleton pattern uses Python&#8217;s <code>__new__</code> method to control instance creation. Every time you call <code>TTSModelServer()</code>, you get the same instance. The model is loaded once on first access and reused for all subsequent requests.</p><p>This is a classic creational pattern and works perfectly for resource-heavy models. Plus, the logging helps you debug issues; you&#8217;ll know immediately if the model failed to load or if you&#8217;re accidentally trying to use it before loading.</p><h3>How XTTS Actually Works - A Brief Technical Deep-Dive</h3><p>I think it&#8217;s worth understanding what&#8217;s happening under the hood, even if you&#8217;re not planning to modify the model.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!jnT-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4915b33-1a8d-4079-a4a7-0afb4623bb4b_1024x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!jnT-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4915b33-1a8d-4079-a4a7-0afb4623bb4b_1024x1024.png 424w, https://substackcdn.com/image/fetch/$s_!jnT-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4915b33-1a8d-4079-a4a7-0afb4623bb4b_1024x1024.png 848w, https://substackcdn.com/image/fetch/$s_!jnT-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4915b33-1a8d-4079-a4a7-0afb4623bb4b_1024x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!jnT-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4915b33-1a8d-4079-a4a7-0afb4623bb4b_1024x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!jnT-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4915b33-1a8d-4079-a4a7-0afb4623bb4b_1024x1024.png" width="1024" height="1024" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c4915b33-1a8d-4079-a4a7-0afb4623bb4b_1024x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1024,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1635586,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpivot.dev/i/180145768?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4915b33-1a8d-4079-a4a7-0afb4623bb4b_1024x1024.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!jnT-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4915b33-1a8d-4079-a4a7-0afb4623bb4b_1024x1024.png 424w, https://substackcdn.com/image/fetch/$s_!jnT-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4915b33-1a8d-4079-a4a7-0afb4623bb4b_1024x1024.png 848w, https://substackcdn.com/image/fetch/$s_!jnT-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4915b33-1a8d-4079-a4a7-0afb4623bb4b_1024x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!jnT-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4915b33-1a8d-4079-a4a7-0afb4623bb4b_1024x1024.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>XTTS uses a Transformer-based architecture with a GPT-style autoregressive decoder. The generation process flows like this:</p><ol><li><p><strong>Text &#8594; Phonemes</strong>: Text is converted to phonetic representation (like &#8220;hello&#8221; becomes &#8220;h eh l ow&#8221;)</p></li><li><p><strong>Phonemes &#8594; Acoustic Features</strong>: The transformer predicts mel spectrogram frames. A mel spectrogram is a visual representation of sound that shows which frequencies are present over time. Imagine a heat map showing which frequencies appear at what times. It&#8217;s basically the blueprint the model uses to understand what the speech should sound like.</p></li><li><p><strong>Acoustic Features &#8594; Audio</strong>: A vocoder (the algorithm that synthesizes the actual sound waves) converts mel spectrograms back into audio you can hear. The vocoder XTTS uses is called HiFi-GAN, which is good at producing natural-sounding speech.</p></li></ol><p>Voice cloning works by encoding a 6-30 second sample into a speaker embedding vector (basically a numerical fingerprint of that voice). Generation then conditions on both the text input and this speaker embedding. The temperature parameter controls randomness in token sampling, just like with GPT models (higher temperature = more variation).</p><p>This architecture is why XTTS can clone voices with such short samples compared to older methods that needed hours of training data!</p><h3>Generating Speech</h3><p>Here&#8217;s the core generation function with comprehensive error handling.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qN5X!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd47e70be-4ee7-4dd5-9ccc-9e625269d9eb_1818x3202.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qN5X!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd47e70be-4ee7-4dd5-9ccc-9e625269d9eb_1818x3202.png 424w, https://substackcdn.com/image/fetch/$s_!qN5X!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd47e70be-4ee7-4dd5-9ccc-9e625269d9eb_1818x3202.png 848w, https://substackcdn.com/image/fetch/$s_!qN5X!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd47e70be-4ee7-4dd5-9ccc-9e625269d9eb_1818x3202.png 1272w, https://substackcdn.com/image/fetch/$s_!qN5X!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd47e70be-4ee7-4dd5-9ccc-9e625269d9eb_1818x3202.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qN5X!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd47e70be-4ee7-4dd5-9ccc-9e625269d9eb_1818x3202.png" width="1456" height="2564" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d47e70be-4ee7-4dd5-9ccc-9e625269d9eb_1818x3202.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:2564,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:723527,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpivot.dev/i/180145768?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd47e70be-4ee7-4dd5-9ccc-9e625269d9eb_1818x3202.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!qN5X!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd47e70be-4ee7-4dd5-9ccc-9e625269d9eb_1818x3202.png 424w, https://substackcdn.com/image/fetch/$s_!qN5X!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd47e70be-4ee7-4dd5-9ccc-9e625269d9eb_1818x3202.png 848w, https://substackcdn.com/image/fetch/$s_!qN5X!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd47e70be-4ee7-4dd5-9ccc-9e625269d9eb_1818x3202.png 1272w, https://substackcdn.com/image/fetch/$s_!qN5X!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd47e70be-4ee7-4dd5-9ccc-9e625269d9eb_1818x3202.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The RTF (Real-Time Factor) metric I log is useful. It tells you if your system can keep up with real-time audio. RTF of 1.0 means it takes 1 second to generate 1 second of audio. RTF of 0.5 means you&#8217;re generating 2x faster than real-time (good!). RTF of 2.0 means you&#8217;re generating 2x slower than real-time (might be a problem for real-time applications).</p><h3>Voice Cloning Implementation</h3><p>Voice cloning is where XTTS really shines. You can clone any voice with just a short sample.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!K5H1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66d4c843-5a5b-4169-9eee-e9706904011b_1566x3276.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!K5H1!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66d4c843-5a5b-4169-9eee-e9706904011b_1566x3276.png 424w, https://substackcdn.com/image/fetch/$s_!K5H1!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66d4c843-5a5b-4169-9eee-e9706904011b_1566x3276.png 848w, https://substackcdn.com/image/fetch/$s_!K5H1!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66d4c843-5a5b-4169-9eee-e9706904011b_1566x3276.png 1272w, https://substackcdn.com/image/fetch/$s_!K5H1!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66d4c843-5a5b-4169-9eee-e9706904011b_1566x3276.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!K5H1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66d4c843-5a5b-4169-9eee-e9706904011b_1566x3276.png" width="1456" height="3046" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/66d4c843-5a5b-4169-9eee-e9706904011b_1566x3276.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:3046,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:762488,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpivot.dev/i/180145768?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66d4c843-5a5b-4169-9eee-e9706904011b_1566x3276.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!K5H1!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66d4c843-5a5b-4169-9eee-e9706904011b_1566x3276.png 424w, https://substackcdn.com/image/fetch/$s_!K5H1!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66d4c843-5a5b-4169-9eee-e9706904011b_1566x3276.png 848w, https://substackcdn.com/image/fetch/$s_!K5H1!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66d4c843-5a5b-4169-9eee-e9706904011b_1566x3276.png 1272w, https://substackcdn.com/image/fetch/$s_!K5H1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66d4c843-5a5b-4169-9eee-e9706904011b_1566x3276.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><blockquote><p><strong>&#9888;&#65039; Important:</strong> Voice samples must be 6-30 seconds of clear speech with minimal background noise. One speaker only. If you have multiple speakers talking in the sample, the model will get confused and generate weird blended voices that sound like neither person. I learned this by accidentally using a podcast clip with two hosts talking over each other. The result was&#8230;interesting &#128517;.</p></blockquote><p>Also, background music or noise degrades quality significantly. I recommend recording in a quiet room or using a sample from a professional podcast with good audio quality.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://mlpivot.dev/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>Storage Management and Caching Strategy</h2><p>You need to properly organize storage for voice samples and generated audio. File paths, metadata, and cleanup all need careful management. More importantly, caching frequently requested generations can save you tons of compute time and money.</p><h3>Why Caching Matters So Much</h3><p>Caching matters more than you might expect for TTS APIs. After analyzing production logs, about 40% of requests were for the same 20 phrases. Things like &#8220;Welcome to our service&#8221;, &#8220;Your order has shipped&#8221;, common notification messages that apps send repeatedly.</p><p>Each generation takes 1-2 seconds of GPU time. Cache lookup? 10-20ms. That difference adds up fast. Implementing caching dropped GPU utilization by 30%, which meant handling more concurrent users on the same hardware.</p><p>Here&#8217;s the caching implementation:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!eHwL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa50acfab-c5bf-444b-91c1-e9f789a2acf8_1936x7968.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!eHwL!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa50acfab-c5bf-444b-91c1-e9f789a2acf8_1936x7968.png 424w, https://substackcdn.com/image/fetch/$s_!eHwL!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa50acfab-c5bf-444b-91c1-e9f789a2acf8_1936x7968.png 848w, https://substackcdn.com/image/fetch/$s_!eHwL!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa50acfab-c5bf-444b-91c1-e9f789a2acf8_1936x7968.png 1272w, https://substackcdn.com/image/fetch/$s_!eHwL!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa50acfab-c5bf-444b-91c1-e9f789a2acf8_1936x7968.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!eHwL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa50acfab-c5bf-444b-91c1-e9f789a2acf8_1936x7968.png" width="1456" height="5992" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a50acfab-c5bf-444b-91c1-e9f789a2acf8_1936x7968.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:5992,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1898223,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpivot.dev/i/180145768?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa50acfab-c5bf-444b-91c1-e9f789a2acf8_1936x7968.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!eHwL!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa50acfab-c5bf-444b-91c1-e9f789a2acf8_1936x7968.png 424w, https://substackcdn.com/image/fetch/$s_!eHwL!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa50acfab-c5bf-444b-91c1-e9f789a2acf8_1936x7968.png 848w, https://substackcdn.com/image/fetch/$s_!eHwL!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa50acfab-c5bf-444b-91c1-e9f789a2acf8_1936x7968.png 1272w, https://substackcdn.com/image/fetch/$s_!eHwL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa50acfab-c5bf-444b-91c1-e9f789a2acf8_1936x7968.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This caching system saved me a fortune in compute costs. The LRU (Least Recently Used) eviction strategy means popular phrases stay cached while rarely-used ones get cleaned up automatically.</p><p><strong>Pop Quiz:</strong> Why use content-based keys (hashing the input) instead of just incrementing counters? &#129300;</p><p><strong>Answer:</strong> Content-based keys are deterministic! The same input always produces the same key, so we can look up cache hits without maintaining a separate mapping. Plus, it naturally handles duplicate requests without complex deduplication logic.</p><h2>Validating Requests with Pydantic</h2><p>Pydantic is amazing for API input validation. It catches bad inputs at the API boundary before they hit your expensive model operations or corrupt your storage layer.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!mlyP!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F305a5117-158e-49f3-b146-1035a8887b08_1650x4320.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!mlyP!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F305a5117-158e-49f3-b146-1035a8887b08_1650x4320.png 424w, https://substackcdn.com/image/fetch/$s_!mlyP!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F305a5117-158e-49f3-b146-1035a8887b08_1650x4320.png 848w, https://substackcdn.com/image/fetch/$s_!mlyP!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F305a5117-158e-49f3-b146-1035a8887b08_1650x4320.png 1272w, https://substackcdn.com/image/fetch/$s_!mlyP!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F305a5117-158e-49f3-b146-1035a8887b08_1650x4320.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!mlyP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F305a5117-158e-49f3-b146-1035a8887b08_1650x4320.png" width="1456" height="3812" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/305a5117-158e-49f3-b146-1035a8887b08_1650x4320.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:3812,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:887494,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpivot.dev/i/180145768?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F305a5117-158e-49f3-b146-1035a8887b08_1650x4320.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!mlyP!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F305a5117-158e-49f3-b146-1035a8887b08_1650x4320.png 424w, https://substackcdn.com/image/fetch/$s_!mlyP!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F305a5117-158e-49f3-b146-1035a8887b08_1650x4320.png 848w, https://substackcdn.com/image/fetch/$s_!mlyP!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F305a5117-158e-49f3-b146-1035a8887b08_1650x4320.png 1272w, https://substackcdn.com/image/fetch/$s_!mlyP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F305a5117-158e-49f3-b146-1035a8887b08_1650x4320.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This validation layer has saved me countless debugging hours. Users sometimes send the wildest inputs. Empty strings, 50,000 character novels, random binary data misidentified as text. Pydantic catches all of it before it hits my model.</p><h2>Building the FastAPI Server</h2><p>Now for the exciting part, actually serving our TTS model over HTTP! This is where everything comes together.</p><h3>Streaming Audio Responses</h3><p>Now we&#8217;re cooking &#128293;. Audio files can be several megabytes, and users can hear audio while it&#8217;s still downloading (progressive playback). This dramatically lowers perceived latency and improves the user experience.</p><p>Here&#8217;s the implementation.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!FMU2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F149ca351-14a8-49aa-8c52-b09fdb3e70b0_1598x5772.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!FMU2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F149ca351-14a8-49aa-8c52-b09fdb3e70b0_1598x5772.png 424w, https://substackcdn.com/image/fetch/$s_!FMU2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F149ca351-14a8-49aa-8c52-b09fdb3e70b0_1598x5772.png 848w, https://substackcdn.com/image/fetch/$s_!FMU2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F149ca351-14a8-49aa-8c52-b09fdb3e70b0_1598x5772.png 1272w, https://substackcdn.com/image/fetch/$s_!FMU2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F149ca351-14a8-49aa-8c52-b09fdb3e70b0_1598x5772.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!FMU2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F149ca351-14a8-49aa-8c52-b09fdb3e70b0_1598x5772.png" width="1456" height="5259" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/149ca351-14a8-49aa-8c52-b09fdb3e70b0_1598x5772.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:5259,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1293973,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpivot.dev/i/180145768?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F149ca351-14a8-49aa-8c52-b09fdb3e70b0_1598x5772.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!FMU2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F149ca351-14a8-49aa-8c52-b09fdb3e70b0_1598x5772.png 424w, https://substackcdn.com/image/fetch/$s_!FMU2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F149ca351-14a8-49aa-8c52-b09fdb3e70b0_1598x5772.png 848w, https://substackcdn.com/image/fetch/$s_!FMU2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F149ca351-14a8-49aa-8c52-b09fdb3e70b0_1598x5772.png 1272w, https://substackcdn.com/image/fetch/$s_!FMU2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F149ca351-14a8-49aa-8c52-b09fdb3e70b0_1598x5772.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The browser receives the first chunk and begins playback immediately. Audio codecs can decode partial files, so progressive download enables playback during transfer. This is absolutely essential for good UX on slow connections.</p><h3>Why Async Matters Here</h3><p>Here&#8217;s something I wish someone had explained to me when I first learned async programming:</p><pre><code># Bad: Blocks event loop during I/O
with open(path, &#8216;rb&#8217;) as f:
    data = f.read()  # Blocks entire server - other requests wait!

# Good: Allows other requests to process
async with aiofiles.open(path, &#8216;rb&#8217;) as f:
    data = await f.read()  # Other requests handled during I/O</code></pre><p>With blocking I/O, your entire server pauses while reading the file. If the file read takes 100ms, no other requests can be processed during that time. With async I/O, Python&#8217;s event loop can handle other requests while waiting for the disk.</p><p>This made a massive difference in my server&#8217;s throughput. Before async, I could handle maybe 10 concurrent requests. After implementing async properly, I could handle 50+ concurrent requests on the same hardware.</p><h3>Voice Upload Endpoint</h3><p>Users need a way to upload their voice samples for cloning. Here&#8217;s how I handle that with proper validation:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!3fTl!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf0f8ef6-4f24-4aa4-bb5b-e9d24ad6927f_1852x3724.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!3fTl!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf0f8ef6-4f24-4aa4-bb5b-e9d24ad6927f_1852x3724.png 424w, https://substackcdn.com/image/fetch/$s_!3fTl!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf0f8ef6-4f24-4aa4-bb5b-e9d24ad6927f_1852x3724.png 848w, https://substackcdn.com/image/fetch/$s_!3fTl!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf0f8ef6-4f24-4aa4-bb5b-e9d24ad6927f_1852x3724.png 1272w, https://substackcdn.com/image/fetch/$s_!3fTl!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf0f8ef6-4f24-4aa4-bb5b-e9d24ad6927f_1852x3724.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!3fTl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf0f8ef6-4f24-4aa4-bb5b-e9d24ad6927f_1852x3724.png" width="1456" height="2928" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/af0f8ef6-4f24-4aa4-bb5b-e9d24ad6927f_1852x3724.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:2928,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:895214,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpivot.dev/i/180145768?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf0f8ef6-4f24-4aa4-bb5b-e9d24ad6927f_1852x3724.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!3fTl!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf0f8ef6-4f24-4aa4-bb5b-e9d24ad6927f_1852x3724.png 424w, https://substackcdn.com/image/fetch/$s_!3fTl!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf0f8ef6-4f24-4aa4-bb5b-e9d24ad6927f_1852x3724.png 848w, https://substackcdn.com/image/fetch/$s_!3fTl!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf0f8ef6-4f24-4aa4-bb5b-e9d24ad6927f_1852x3724.png 1272w, https://substackcdn.com/image/fetch/$s_!3fTl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf0f8ef6-4f24-4aa4-bb5b-e9d24ad6927f_1852x3724.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The finally block is critical here. Even if something fails during processing, we clean up the temporary file. Otherwise, failed uploads would fill up your disk over time. Ask me how I know &#128527;.</p><h2>Building the Python Client</h2><p>The client library provides a clean interface for users. It handles all the HTTP details, streaming, retry logic, and provides a Pythonic API that feels natural.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!POYw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F517be02d-de9d-4483-8e8f-b15eae49ca1b_1902x8676.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!POYw!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F517be02d-de9d-4483-8e8f-b15eae49ca1b_1902x8676.png 424w, https://substackcdn.com/image/fetch/$s_!POYw!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F517be02d-de9d-4483-8e8f-b15eae49ca1b_1902x8676.png 848w, https://substackcdn.com/image/fetch/$s_!POYw!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F517be02d-de9d-4483-8e8f-b15eae49ca1b_1902x8676.png 1272w, https://substackcdn.com/image/fetch/$s_!POYw!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F517be02d-de9d-4483-8e8f-b15eae49ca1b_1902x8676.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!POYw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F517be02d-de9d-4483-8e8f-b15eae49ca1b_1902x8676.png" width="1456" height="6642" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/517be02d-de9d-4483-8e8f-b15eae49ca1b_1902x8676.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:6642,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1919531,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpivot.dev/i/180145768?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F517be02d-de9d-4483-8e8f-b15eae49ca1b_1902x8676.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!POYw!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F517be02d-de9d-4483-8e8f-b15eae49ca1b_1902x8676.png 424w, https://substackcdn.com/image/fetch/$s_!POYw!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F517be02d-de9d-4483-8e8f-b15eae49ca1b_1902x8676.png 848w, https://substackcdn.com/image/fetch/$s_!POYw!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F517be02d-de9d-4483-8e8f-b15eae49ca1b_1902x8676.png 1272w, https://substackcdn.com/image/fetch/$s_!POYw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F517be02d-de9d-4483-8e8f-b15eae49ca1b_1902x8676.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The retry logic with exponential backoff smooths out transient errors. Sometimes servers get temporarily overwhelmed. Instead of failing immediately, the client automatically retries with increasing delays.</p><h2>Testing Your System</h2><p>Testing audio systems is different from testing text APIs. String comparison doesn&#8217;t work when you&#8217;re dealing with binary audio files. You need to verify formats, check durations, validate file sizes, and sometimes manually listen to outputs. Here&#8217;s how I structure tests. </p><h3>Unit Tests for Audio Processing</h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!7dOA!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ff8b5cf-2db8-4665-a04b-736ac6afb214_1582x2010.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!7dOA!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ff8b5cf-2db8-4665-a04b-736ac6afb214_1582x2010.png 424w, https://substackcdn.com/image/fetch/$s_!7dOA!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ff8b5cf-2db8-4665-a04b-736ac6afb214_1582x2010.png 848w, https://substackcdn.com/image/fetch/$s_!7dOA!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ff8b5cf-2db8-4665-a04b-736ac6afb214_1582x2010.png 1272w, https://substackcdn.com/image/fetch/$s_!7dOA!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ff8b5cf-2db8-4665-a04b-736ac6afb214_1582x2010.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!7dOA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ff8b5cf-2db8-4665-a04b-736ac6afb214_1582x2010.png" width="1456" height="1850" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9ff8b5cf-2db8-4665-a04b-736ac6afb214_1582x2010.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1850,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:509439,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpivot.dev/i/180145768?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ff8b5cf-2db8-4665-a04b-736ac6afb214_1582x2010.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!7dOA!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ff8b5cf-2db8-4665-a04b-736ac6afb214_1582x2010.png 424w, https://substackcdn.com/image/fetch/$s_!7dOA!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ff8b5cf-2db8-4665-a04b-736ac6afb214_1582x2010.png 848w, https://substackcdn.com/image/fetch/$s_!7dOA!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ff8b5cf-2db8-4665-a04b-736ac6afb214_1582x2010.png 1272w, https://substackcdn.com/image/fetch/$s_!7dOA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ff8b5cf-2db8-4665-a04b-736ac6afb214_1582x2010.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>Integration Tests</h3><p>Integration tests verify the whole system works together.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!pyq8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3756820-2a2e-41db-b58d-b3938fd64e08_1650x4394.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!pyq8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3756820-2a2e-41db-b58d-b3938fd64e08_1650x4394.png 424w, https://substackcdn.com/image/fetch/$s_!pyq8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3756820-2a2e-41db-b58d-b3938fd64e08_1650x4394.png 848w, https://substackcdn.com/image/fetch/$s_!pyq8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3756820-2a2e-41db-b58d-b3938fd64e08_1650x4394.png 1272w, https://substackcdn.com/image/fetch/$s_!pyq8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3756820-2a2e-41db-b58d-b3938fd64e08_1650x4394.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!pyq8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3756820-2a2e-41db-b58d-b3938fd64e08_1650x4394.png" width="1456" height="3877" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b3756820-2a2e-41db-b58d-b3938fd64e08_1650x4394.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:3877,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1021957,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpivot.dev/i/180145768?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3756820-2a2e-41db-b58d-b3938fd64e08_1650x4394.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!pyq8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3756820-2a2e-41db-b58d-b3938fd64e08_1650x4394.png 424w, https://substackcdn.com/image/fetch/$s_!pyq8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3756820-2a2e-41db-b58d-b3938fd64e08_1650x4394.png 848w, https://substackcdn.com/image/fetch/$s_!pyq8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3756820-2a2e-41db-b58d-b3938fd64e08_1650x4394.png 1272w, https://substackcdn.com/image/fetch/$s_!pyq8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3756820-2a2e-41db-b58d-b3938fd64e08_1650x4394.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>What We&#8217;ve Built</h2><p>Let&#8217;s take a step back and bask in the glory of our beautiful TTS creation!</p><p>We have:</p><ul><li><p><strong>Model server with singleton pattern</strong> - Efficient resource usage, handles model lifecycle</p></li><li><p><strong>Audio processing pipeline</strong> - Normalization, resampling, format conversion with quality preservation</p></li><li><p><strong>File storage and caching</strong> - Content-based caching with LRU eviction, saving compute costs</p></li><li><p><strong>Voice sample management</strong> - Upload, validation, preprocessing for voice cloning</p></li><li><p><strong>RESTful API with streaming</strong> - Progressive playback, proper error handling, async I/O</p></li><li><p><strong>Python client library</strong> - Clean API, automatic retries, comprehensive error messages</p></li><li><p><strong>Comprehensive testing</strong> - Unit tests, integration tests, load tests</p></li></ul><p>The fundamental lessons were efficient resource management, smart caching, streaming delivery, and clean separation of concerns.</p><h2>Next Steps</h2><p>In Part 2, I&#8217;ll cover production deployment.</p><p>We&#8217;ll tackle:</p><ul><li><p><strong>Dockerization</strong>: Creating efficient Docker images, multi-stage builds for smaller sizes</p></li><li><p><strong>Orchestration</strong>: Kubernetes deployment, scaling strategies, load balancing</p></li><li><p><strong>CDN integration</strong>: Serving cached audio from edge locations for better global performance</p></li><li><p><strong>Cost optimization</strong>: Reducing GPU costs, efficient batching, spot instances</p></li><li><p><strong>Monitoring and alerting</strong>: Metrics that matter, detecting quality degradation, performance tracking</p></li><li><p><strong>Scaling considerations</strong>: Horizontal vs vertical scaling, when to scale up</p></li></ul><p>For now, I encourage you to run this code locally and experiment!</p><h3>Homework</h3><p>I recommend you:</p><ol><li><p><strong>Upload your own voice</strong> - Try cloning your voice with a 15-second sample</p></li><li><p><strong>Measure cache performance</strong> - Generate the same phrase 10 times, compare first vs subsequent requests</p></li><li><p><strong>Test different audio formats</strong> - Compare MP3 at 128k vs 192k, see if you can hear the difference</p></li><li><p><strong>Experiment with text lengths</strong> - How does generation time scale with text length?</p></li><li><p><strong>Monitor GPU utilization</strong> - Run <code>nvidia-smi</code> while generating, watch memory and utilization</p></li><li><p><strong>Try different voices</strong> - Upload multiple voice samples, see how well cloning works</p></li></ol><p>You can also extend this system in many directions.</p><ul><li><p><strong>Emotion control</strong>: Add parameters for controlling emotional tone of speech</p></li><li><p><strong>SSML support</strong>: Implement Speech Synthesis Markup Language for prosody control</p></li><li><p><strong>Multi-speaker conversations</strong>: Generate dialogues with different voices</p></li><li><p><strong>Real-time streaming</strong>: Build WebSocket endpoint for real-time generation as you type</p></li><li><p><strong>Audio effects pipeline</strong>: Add reverb, echo, pitch shifting for creative applications</p></li><li><p><strong>Batch processing</strong>: API endpoint that handles hundreds of texts at once efficiently</p></li></ul><p>I could probably create branching posts that go into each of these different avenues.</p><p>Regardless, the patterns here transfer to whatever you build next. Whether you&#8217;re serving image generation models, video processing pipelines, or any compute-intensive API, you&#8217;ll use these same principles: caching for cost savings, streaming for better UX, singleton pattern for resource management, comprehensive testing, and async I/O for scalability.</p><p>I hope this guide gave you both the technical knowledge and practical insights to build your own production TTS system! Feel free to reach me on LinkedIn or at cynthia@cynscode.com.</p><p>Happy coding &#128077;&#127998;</p><p></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://mlpivot.dev/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Building a Vision API with Magma8B]]></title><description><![CDATA[Vision models describe what's in an image, but they can't handle spatial references. Point at an object and ask "What color is this car?" and the model doesn't know what you're talking about. In this post we'll learn about Set-of-Mark prompting and how vision models can see what you're seeing &#128064;.]]></description><link>https://mlpivot.dev/p/building-a-vision-api-with-magma8b</link><guid isPermaLink="false">https://mlpivot.dev/p/building-a-vision-api-with-magma8b</guid><dc:creator><![CDATA[The Machine Learning Pivot]]></dc:creator><pubDate>Sat, 06 Dec 2025 18:58:38 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!90V6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa572527-55f0-436f-810f-515c0d85d1a9_1024x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Vision-language models have gotten really powerful lately, but there&#8217;s a fundamental problem when you try to use them for spatial reasoning. Upload an image with five cars and ask &#8220;What color is that car?&#8221; The model has no clue which one you&#8217;re talking about. It might describe all of them, pick randomly, or get confused and give you vague nonsense.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!90V6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa572527-55f0-436f-810f-515c0d85d1a9_1024x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!90V6!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa572527-55f0-436f-810f-515c0d85d1a9_1024x1024.png 424w, https://substackcdn.com/image/fetch/$s_!90V6!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa572527-55f0-436f-810f-515c0d85d1a9_1024x1024.png 848w, https://substackcdn.com/image/fetch/$s_!90V6!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa572527-55f0-436f-810f-515c0d85d1a9_1024x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!90V6!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa572527-55f0-436f-810f-515c0d85d1a9_1024x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!90V6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa572527-55f0-436f-810f-515c0d85d1a9_1024x1024.png" width="1024" height="1024" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fa572527-55f0-436f-810f-515c0d85d1a9_1024x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1024,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1618641,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://mlpivot.dev/i/179921361?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa572527-55f0-436f-810f-515c0d85d1a9_1024x1024.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!90V6!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa572527-55f0-436f-810f-515c0d85d1a9_1024x1024.png 424w, https://substackcdn.com/image/fetch/$s_!90V6!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa572527-55f0-436f-810f-515c0d85d1a9_1024x1024.png 848w, https://substackcdn.com/image/fetch/$s_!90V6!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa572527-55f0-436f-810f-515c0d85d1a9_1024x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!90V6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa572527-55f0-436f-810f-515c0d85d1a9_1024x1024.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>You could try being more descriptive: &#8220;the red car on the left side near the building.&#8221; Now the model has to understand relative positions, colors, spatial relationships, and landmarks all at once. That&#8217;s asking a lot, and the results are hit or miss.</p><p>Microsoft&#8217;s Magma8B solves this with Set-of-Mark (SoM) prompting. You draw numbered boxes directly on the image. Instead of vague spatial descriptions, you mark the car as #1 and ask &#8220;What color is object #1?&#8221; The model understands these visual markers because it was specifically trained on them. No ambiguity, no confusion.</p><p>I&#8217;m walking you through building a complete client-server API using Magma8B. You&#8217;ll learn how to serve an 8-billion parameter vision model, handle image encoding over HTTP, implement proper error handling, and structure code that doesn&#8217;t fall apart when things go wrong. The software engineering patterns here apply whether you&#8217;re serving ML models, building microservices, or designing any distributed system.</p><p>This post will separate you from the other machine learning engineers or data scientists who can only get a model working in a Jupyter notebook. By the end of this article, you&#8217;ll walk away knowing how to build a system that handles real requests.</p><h2>What You Need to Know</h2><p>You should be comfortable with Python classes and functions. REST APIs should make sense at a basic level. If you&#8217;ve used <code>requests.get()</code> and understand what HTTP status codes mean, you&#8217;re fine. Some understanding of neural networks helps, but you don&#8217;t need to derive backpropagation or understand gradient descent in depth.</p><p>Docker knowledge isn&#8217;t required for this part. We&#8217;ll cover containerization and deployment in Part 2.</p><h2>Understanding Set-of-Mark Prompting</h2><p>Traditional visual question answering (VQA) systems struggle with spatial references. They can tell you what&#8217;s in an image, but when you need to point at something specific and ask about <em>that exact thing</em>, they don&#8217;t have a reliable mechanism to understand your reference.</p><p>Set-of-Mark prompting changes this by making spatial references explicit. Here&#8217;s how it works:</p><ol><li><p>Add visual markers to your image (numbered bounding boxes, highlights, or tags)</p></li><li><p>The model processes both the marked image and your text prompt</p></li><li><p>Reference specific markers in your text: &#8220;What is object #3 doing?&#8221;</p></li><li><p>The model grounds its response in the marked regions</p></li></ol><p>The model learned to associate text references like &#8220;#3&#8221; with visual regions during training. When you ask about &#8220;#3&#8221;, the model knows exactly which part of the image you mean.</p><p>This isn&#8217;t just a clever prompt engineering trick. Magma8B was trained on millions of images with Set-of-Mark annotations. The model&#8217;s weights encode the relationship between numeric references in text and visual markers in images. You can&#8217;t just take any vision-language model and expect it to understand marked images.</p><h3>Real-World Applications</h3><p>Set-of-Mark prompting unlocks some practical use cases.</p><p><strong>Document Analysis:</strong> When working with scanned documents, you can mark specific paragraphs or sections. &#8220;Summarize paragraph #3&#8221; becomes unambiguous. Legal document review, contract analysis, and research paper summarization all benefit from this precision.</p><p><strong>UI Testing:</strong> Automated testing frameworks can mark specific UI elements. &#8220;Click button #2&#8221; gives you maintainable tests that don&#8217;t break when the UI layout changes slightly.</p><p><strong>Robotics:</strong> Pick-and-place tasks become more reliable. &#8220;Grasp object #1&#8221; tells the robot exactly which object to target, even in cluttered environments.</p><p><strong>Medical Imaging:</strong> Radiologists can mark regions of interest and ask specific questions. &#8220;Analyze the tissue marked as #1&#8221; focuses the model&#8217;s attention on the relevant area.</p><p>The key advantage is precision. Spatial references are explicit rather than implicit, reducing errors and making the system&#8217;s behavior predictable.</p><h2>Understanding Magma8B &#129302;</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!wAzN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e4ba2db-405b-48cd-8b4c-cde560840b61_5776x3851.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!wAzN!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e4ba2db-405b-48cd-8b4c-cde560840b61_5776x3851.jpeg 424w, https://substackcdn.com/image/fetch/$s_!wAzN!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e4ba2db-405b-48cd-8b4c-cde560840b61_5776x3851.jpeg 848w, https://substackcdn.com/image/fetch/$s_!wAzN!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e4ba2db-405b-48cd-8b4c-cde560840b61_5776x3851.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!wAzN!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e4ba2db-405b-48cd-8b4c-cde560840b61_5776x3851.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!wAzN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e4ba2db-405b-48cd-8b4c-cde560840b61_5776x3851.jpeg" width="1456" height="971" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9e4ba2db-405b-48cd-8b4c-cde560840b61_5776x3851.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:971,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:12254938,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpivot.dev/i/179921361?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e4ba2db-405b-48cd-8b4c-cde560840b61_5776x3851.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!wAzN!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e4ba2db-405b-48cd-8b4c-cde560840b61_5776x3851.jpeg 424w, https://substackcdn.com/image/fetch/$s_!wAzN!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e4ba2db-405b-48cd-8b4c-cde560840b61_5776x3851.jpeg 848w, https://substackcdn.com/image/fetch/$s_!wAzN!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e4ba2db-405b-48cd-8b4c-cde560840b61_5776x3851.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!wAzN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9e4ba2db-405b-48cd-8b4c-cde560840b61_5776x3851.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Magma8B is Microsoft&#8217;s vision-language-action model released in February 2025. It has 8 billion parameters and combines a CLIP-ConvNeXt-XXLarge vision encoder with Meta&#8217;s Llama-3 language model.</p><p>What makes Magma significant is that it&#8217;s the <strong>first</strong> foundation model designed specifically for multimodal AI agents!</p><p>Most vision-language models stop at understanding. They describe images or answer questions about them. Magma goes further. It generates goal-driven visual plans and actions. The same model that answers &#8216;What color is object #1?&#8217; can tell a browser to click the search button or generate gripper coordinates for a robot arm.</p><p>In Microsoft&#8217;s benchmarks, Magma is the only model that performs across the full task spectrum: visual QA, UI navigation, and robotics manipulation. GPT-4V can&#8217;t control a robot arm. OpenVLA can&#8217;t answer questions about charts. Magma does both.</p><p>The architecture has three main components:</p><p><strong>Vision Encoder</strong>: CLIP-ConvNeXt-XXLarge, trained by the LAION team on billions of image-text pairs. It processes images into visual tokens that capture colors, shapes, textures, spatial relationships, and the visual markers you add.</p><p><strong>Language Model</strong>: Meta&#8217;s Llama-3-8B-Instruct serves as the backbone. It processes both text tokens and visual tokens together. This is where reasoning happens.</p><p><strong>Training Data</strong>: Microsoft trained Magma on a mix of generic image and video data (LLaVA-Next, ShareGPT4Video), instructional videos (Ego4D, Epic-Kitchen), robotics manipulation data (Open-X-Embodiment), and UI grounding data (SeeClick). The diversity matters; it&#8217;s why the model generalizes across such different tasks.</p><p>Set-of-Mark isn&#8217;t something you can apply to just any vision model. It&#8217;s specific to how Magma was trained. Microsoft introduced two techniques during training:</p><p><strong>Set-of-Mark (SoM):</strong> For static images, objects are labeled with visual markers. The model learned to ground its responses to specific marked regions.</p><p><strong>Trace-of-Mark (ToM):</strong> For videos, object movements are tracked over time. This helps with action planning and understanding temporal dynamics.</p><p>We&#8217;re focusing on Set-of-Mark for static images. Trace-of-Mark is powerful for video and robotics data, but that&#8217;s beyond scope here.</p><blockquote><p>&#9888;&#65039;<strong> Important Research Context: </strong>Magma8B is explicitly designed for research purposes. Microsoft&#8217;s model card states it should not be deployed in production settings or used in high-risk scenarios like military, financial services, or critical infrastructure.</p></blockquote><p>This doesn&#8217;t mean you can&#8217;t learn from it or build prototypes. It means you need to understand its limitations:</p><ul><li><p>The model was trained primarily on English text</p></li><li><p>It can perpetuate stereotypes or produce inappropriate content</p></li><li><p>It has common multimodal model limitations around hallucination</p></li><li><p>It hasn&#8217;t been extensively safety-tested for production use</p></li></ul><p>For learning, experimentation, and non-critical applications, Magma8B is excellent. For production systems handling sensitive data or critical decisions, you&#8217;d want a model with more extensive safety testing and production support.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://mlpivot.dev/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>Why Client-Server Architecture Matters</h2><p>Running this model directly in your application isn&#8217;t practical for several reasons.</p><p>The model checkpoint is 16GB. You&#8217;re not shipping that to mobile devices or embedding it in web applications. Inference needs a GPU with at least 24GB of VRAM. Most consumer laptops don&#8217;t have that. Loading the model takes several seconds on a good machine, longer on slower hardware. You don&#8217;t want users waiting that long on first launch.</p><p>Model updates become a nightmare. If you need to fix a bug, you&#8217;ll need to push to every client. If you need to upgrade to a newer version, you&#8217;ll need to update every installation. How about using a different model? You&#8217;ll need to rebuild and redistribute your entire application.</p><p>Client-server architecture solves these problems.</p><ol><li><p>Load the model once on a server with proper GPU resources</p></li><li><p>Clients send images and prompts over HTTP</p></li><li><p>Server runs inference and returns results</p></li><li><p>Clients are lightweight and model-agnostic</p></li></ol><p>This separation means you can upgrade the model, switch to an entirely different architecture, scale to multiple GPUs, or add caching without changing a single line of client code.</p><p>The architecture looks like this </p><pre><code>Client (Python/JS/Mobile) &#8594; API Server (FastAPI) &#8594; Model Server (Magma8B on GPU)</code></pre><p>The API server handles HTTP requests, validates inputs, manages authentication (if needed), and orchestrates inference. The model server focuses solely on loading the model, running inference, and managing GPU memory. Each component has one job and does it well.</p><h3>Separation of Concerns</h3><p>This follows a fundamental software engineering principle. Your API layer handles authentication, rate limiting, and logging without touching model code. Your model code focuses exclusively on inference. Want to swap models? Change one file. Want to add API key validation? Modify only the API layer.</p><p>You see this pattern everywhere in well-designed systems.</p><ul><li><p>Web applications separate frontend, backend, and database layers</p></li><li><p>Microservices split functionality across independent services</p></li><li><p>Operating systems separate kernel space from user space</p></li></ul><h3>Network Boundaries Force Good Design</h3><p>Building across network boundaries actually makes you write better code because it forces constraints.</p><p><strong>You must serialize data:</strong> Can&#8217;t pass Python objects over HTTP. This forces you to think carefully about your data structures and interfaces.</p><p><strong>You must handle errors explicitly:</strong> Networks fail. Requests time out. Servers crash. You can&#8217;t pretend these won&#8217;t happen.</p><p><strong>You must define clear interfaces:</strong> Ambiguous interfaces break quickly over network boundaries. Clear contracts become essential.</p><p><strong>You must think about performance:</strong> Network latency matters. You optimize data transfer and response times.</p><p>These constraints push you toward maintainable, robust code. This is why microservices (despite their complexity) can lead to better system design. The boundaries force discipline.</p><h2>Building the Model Server</h2><p>Let&#8217;s start with proper project structure. Good organization makes debugging easier and helps others (including future you) understand your code.</p><pre><code>magma-api/
&#9500;&#9472;&#9472; server/
&#9474;   &#9500;&#9472;&#9472; __init__.py
&#9474;   &#9500;&#9472;&#9472; main.py          # FastAPI application
&#9474;   &#9500;&#9472;&#9472; model.py         # Model loading and inference
&#9474;   &#9500;&#9472;&#9472; schemas.py       # Pydantic models for validation
&#9474;   &#9492;&#9472;&#9472; config.py        # Configuration management
&#9500;&#9472;&#9472; client/
&#9474;   &#9492;&#9472;&#9472; client.py        # Python client library
&#9500;&#9472;&#9472; requirements.txt
&#9500;&#9472;&#9472; .env                 # Local environment variables (add to .gitignore!)
&#9492;&#9472;&#9472; README.md</code></pre><p>Each file has a single responsibility. The API code lives in <code>main.py</code>, model logic in <code>model.py</code>, validation schemas in <code>schemas.py</code>, and configuration in <code>config.py</code>. This separation makes testing easier and changes more localized.</p><p>Create <code>requirements.txt</code>:</p><pre><code>fastapi==0.104.1
uvicorn[standard]==0.24.0
torch==2.1.0
transformers==4.35.0
Pillow==10.1.0
pydantic==2.5.0
python-multipart==0.0.6</code></pre><p>Quick breakdown of dependencies &#128376;&#65039;</p><ul><li><p><strong>FastAPI and Uvicorn:</strong> Web framework and ASGI server. FastAPI gives you automatic validation, documentation, and excellent async support.</p></li><li><p><strong>PyTorch and Transformers:</strong> Model inference. Transformers make loading Hugging Face models straightforward.</p></li><li><p><strong>Pillow:</strong> Image processing for loading and manipulating images.</p></li><li><p><strong>Pydantic:</strong> Data validation. This will save you from so many bugs.</p></li><li><p><strong>python-multipart:</strong> File upload support for when we extend the API.</p></li></ul><h3>Model Loading and Management</h3><p>Loading a 16GB model takes time, consumes significant memory, and you definitely don&#8217;t want to reload it for every request. That would be painfully slow and waste resources.</p><p>The solution is the Singleton pattern. Load the model once, keep it in memory, reuse it for every request. This is standard practice for serving ML models.</p><p>Create <code>server/model.py</code>:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-TFc!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1aecf70f-073f-4405-b979-8dbb5a1944fa_1616x2196.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-TFc!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1aecf70f-073f-4405-b979-8dbb5a1944fa_1616x2196.png 424w, https://substackcdn.com/image/fetch/$s_!-TFc!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1aecf70f-073f-4405-b979-8dbb5a1944fa_1616x2196.png 848w, https://substackcdn.com/image/fetch/$s_!-TFc!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1aecf70f-073f-4405-b979-8dbb5a1944fa_1616x2196.png 1272w, https://substackcdn.com/image/fetch/$s_!-TFc!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1aecf70f-073f-4405-b979-8dbb5a1944fa_1616x2196.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-TFc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1aecf70f-073f-4405-b979-8dbb5a1944fa_1616x2196.png" width="1456" height="1979" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1aecf70f-073f-4405-b979-8dbb5a1944fa_1616x2196.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1979,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:479404,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpivot.dev/i/179921361?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1aecf70f-073f-4405-b979-8dbb5a1944fa_1616x2196.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!-TFc!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1aecf70f-073f-4405-b979-8dbb5a1944fa_1616x2196.png 424w, https://substackcdn.com/image/fetch/$s_!-TFc!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1aecf70f-073f-4405-b979-8dbb5a1944fa_1616x2196.png 848w, https://substackcdn.com/image/fetch/$s_!-TFc!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1aecf70f-073f-4405-b979-8dbb5a1944fa_1616x2196.png 1272w, https://substackcdn.com/image/fetch/$s_!-TFc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1aecf70f-073f-4405-b979-8dbb5a1944fa_1616x2196.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h4>Understanding the Singleton Pattern</h4><p>The <code>__new__</code> method controls instance creation in Python. Most classes use the default implementation, but we override it to ensure only one instance ever exists.</p><p>First time you create <code>ModelServer(),</code> Python creates a new instance and stores it in <code>cls._instance</code>. Every subsequent call checks if <code>_instance</code> exists and returns it if so. One model in memory, shared across all requests.</p><p>Why does this matter? If you loaded a new model for every request:</p><ul><li><p>You&#8217;d wait several seconds per request (terrible user experience)</p></li><li><p>You&#8217;d run out of GPU memory immediately (each model copy takes 16GB)</p></li><li><p>You&#8217;d waste compute resources loading the same weights repeatedly</p></li></ul><p>The Singleton pattern is common in production systems for expensive resources:</p><ul><li><p>Database connection pools (one pool, many connections)</p></li><li><p>Configuration managers (one config object, accessed everywhere)</p></li><li><p>Logger instances (one logger per module)</p></li><li><p>Cache managers (one cache instance, shared state)</p></li></ul><h4>Memory Optimization with bfloat16</h4><pre><code>torch_dtype=torch.bfloat16</code></pre><p>This cuts memory usage in half compared to float32. Instead of 32-bit floating point numbers, we use 16-bit Brain Float format.</p><p>&#8220;Wait, doesn&#8217;t that hurt accuracy?&#8221; Not really for inference. Training needs float32 (or even float64 sometimes) for stable gradients across many iterations. Inference runs once through the network, and bfloat16 precision is usually sufficient.</p><p>The memory savings are massive. With bfloat16, Magma8B fits comfortably in 24GB of VRAM. With float32, you&#8217;d need closer to 32GB. This is standard practice in production ML serving.</p><h4>Device Mapping</h4><pre><code>device_map="auto"</code></pre><p>This tells Transformers to automatically distribute the model across available devices. If you have multiple GPUs, it can split layers across them. If your model barely fits in GPU memory, it can offload some layers to CPU memory (slower, but better than crashing).</p><p>For most cases with a single 24GB GPU, this just loads everything onto CUDA device 0.</p><h4>Property Decorator for Clean Access</h4><pre><code>@property
def model(self):
    if self._model is None:
        raise RuntimeError(&#8221;Model not loaded. Call load_model first.&#8221;)
    return self._model</code></pre><p>This creates a clean access pattern with built-in validation. Instead of accessing <code>_model</code> directly, you use <code>model_server.model</code>. If someone tries to use the model before loading it, they get a clear error message:</p><pre><code>RuntimeError: Model not loaded. Call load_model first.</code></pre><p>Much better than:</p><pre><code>AttributeError: &#8216;NoneType&#8217; object has no attribute &#8216;generate&#8217;</code></pre><p>Good error messages save hours of debugging. When something goes wrong, you want to know exactly what and where.</p><h3>Implementing Set-of-Mark Inference</h3><p>Now let&#8217;s add the actual inference logic. This is where Set-of-Mark prompting happens.</p><p>Add to <code>server/model.py</code>:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!kDnj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F719dfe67-6aeb-4795-bcee-a6d2332ddb55_1936x4654.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!kDnj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F719dfe67-6aeb-4795-bcee-a6d2332ddb55_1936x4654.png 424w, https://substackcdn.com/image/fetch/$s_!kDnj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F719dfe67-6aeb-4795-bcee-a6d2332ddb55_1936x4654.png 848w, https://substackcdn.com/image/fetch/$s_!kDnj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F719dfe67-6aeb-4795-bcee-a6d2332ddb55_1936x4654.png 1272w, https://substackcdn.com/image/fetch/$s_!kDnj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F719dfe67-6aeb-4795-bcee-a6d2332ddb55_1936x4654.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!kDnj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F719dfe67-6aeb-4795-bcee-a6d2332ddb55_1936x4654.png" width="1456" height="3500" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/719dfe67-6aeb-4795-bcee-a6d2332ddb55_1936x4654.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:3500,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:954055,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpivot.dev/i/179921361?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F719dfe67-6aeb-4795-bcee-a6d2332ddb55_1936x4654.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!kDnj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F719dfe67-6aeb-4795-bcee-a6d2332ddb55_1936x4654.png 424w, https://substackcdn.com/image/fetch/$s_!kDnj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F719dfe67-6aeb-4795-bcee-a6d2332ddb55_1936x4654.png 848w, https://substackcdn.com/image/fetch/$s_!kDnj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F719dfe67-6aeb-4795-bcee-a6d2332ddb55_1936x4654.png 1272w, https://substackcdn.com/image/fetch/$s_!kDnj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F719dfe67-6aeb-4795-bcee-a6d2332ddb55_1936x4654.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h4>Drawing Visual Marks</h4><p>The marks get drawn directly on the image. This is crucial: the model needs to see them to understand references like &#8220;#1.&#8221; We&#8217;re modifying the visual input, not just adding metadata.</p><pre><code>draw.rectangle([x1, y1, x2, y2], outline=&#8221;red&#8221;, width=3)
draw.text((x1, y1 - 30), label, fill=&#8221;white&#8221;, font=font)</code></pre><p>We draw red bounding boxes with 3-pixel width for visibility. The label goes above the box with a red background, making it stand out even on complex images.</p><p>During training, Magma saw similar visual markers. It learned to associate these visual cues with text references. When you ask &#8220;What is object #1?&#8221;, the model looks for the visual marker labeled &#8220;#1&#8221; and reasons about that specific region!</p><h4>Chat Template Format</h4><p>Magma expects inputs in a specific chat format:</p><pre><code>convs = [
    {&#8221;role&#8221;: &#8220;system&#8221;, &#8220;content&#8221;: &#8220;You are agent that can see, talk and act.&#8221;},
    {&#8221;role&#8221;: &#8220;user&#8221;, &#8220;content&#8221;: &#8220;&lt;image_start&gt;&lt;image&gt;&lt;image_end&gt;\nWhat is #1?&#8221;}
]</code></pre><p>The system message sets the context. The user message includes special tokens <code>&lt;image_start&gt;</code> and <code>&lt;image_end&gt; </code>that tell the model where the image embedding goes. </p><p>The <code>apply_chat_template</code> method formats these messages according to Llama-3&#8217;s conversational structure. Different models have different templates, and Transformers handle this automatically.</p><h4>Tensor Processing</h4><pre><code>inputs = self.processor(images=[image], texts=text, return_tensors=&#8221;pt&#8221;)
inputs = inputs.to(self.model.device)</code></pre><p>The processor handles several transformations:</p><ol><li><p>Resizes the image to the expected dimensions (typically 224x224 or 384x384)</p></li><li><p>Normalizes pixel values to the range the model expects</p></li><li><p>Converts everything to PyTorch tensors</p></li><li><p>Tokenizes the text into token IDs</p></li></ol><p>Moving to the model&#8217;s device is essential. The model lives on GPU, and mixing CPU and GPU tensors causes errors fast.</p><h4>Inference Mode</h4><pre><code>with torch.inference_mode():
    outputs = self.model.generate(...)</code></pre><p>This is better than <code>torch.no_grad()</code> for inference. It disables gradient computation and enables inference-specific optimizations. Always use this for production serving.</p><p>Without this context manager, PyTorch tracks gradients for every operation (in case you want to do backpropagation). This uses significant memory and slows things down. For inference, you never need gradients.</p><h4>Deterministic Generation</h4><pre><code>do_sample=False</code></pre><p>This makes generation deterministic. The model always picks the highest probability token at each step. Same input gives same output.</p><p>For analytical queries like &#8220;What color is object #1?&#8221;, you want consistent results. The answer shouldn&#8217;t randomly vary between &#8220;red&#8221;, &#8220;crimson&#8221;, and &#8220;burgundy&#8221; depending on sampling randomness.</p><p>If you wanted creative or varied outputs (like for creative writing), you&#8217;d set <code>do_sample=True</code> and adjust <code>temperature</code>. But for analytical tasks, deterministic output makes sense.</p><h4>Autoregressive Generation</h4><p>The <code>generate</code> method works autoregressively:</p><ol><li><p>Model generates one token</p></li><li><p>Appends it to the input</p></li><li><p>Generates the next token based on all previous tokens</p></li><li><p>Repeats until reaching <code>max_new_tokens</code> or an end-of-sequence token</p></li></ol><p>This is why generation can be slow for long responses. Each token depends on all previous tokens, so you can&#8217;t parallelize across the sequence length dimension.</p><p>For a 100-token response, the model runs 100 forward passes.</p><h3>Error Handling and Resource Management</h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!pSzX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a596ee4-3a76-46fe-9d6d-170175d359e9_5472x3648.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!pSzX!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a596ee4-3a76-46fe-9d6d-170175d359e9_5472x3648.jpeg 424w, https://substackcdn.com/image/fetch/$s_!pSzX!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a596ee4-3a76-46fe-9d6d-170175d359e9_5472x3648.jpeg 848w, https://substackcdn.com/image/fetch/$s_!pSzX!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a596ee4-3a76-46fe-9d6d-170175d359e9_5472x3648.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!pSzX!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a596ee4-3a76-46fe-9d6d-170175d359e9_5472x3648.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!pSzX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a596ee4-3a76-46fe-9d6d-170175d359e9_5472x3648.jpeg" width="1456" height="971" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8a596ee4-3a76-46fe-9d6d-170175d359e9_5472x3648.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:971,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1269297,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpivot.dev/i/179921361?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a596ee4-3a76-46fe-9d6d-170175d359e9_5472x3648.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!pSzX!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a596ee4-3a76-46fe-9d6d-170175d359e9_5472x3648.jpeg 424w, https://substackcdn.com/image/fetch/$s_!pSzX!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a596ee4-3a76-46fe-9d6d-170175d359e9_5472x3648.jpeg 848w, https://substackcdn.com/image/fetch/$s_!pSzX!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a596ee4-3a76-46fe-9d6d-170175d359e9_5472x3648.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!pSzX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a596ee4-3a76-46fe-9d6d-170175d359e9_5472x3648.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Things will go wrong. GPUs run out of memory. Images are corrupted. Networks time out. Users send invalid inputs. Let&#8217;s handle these gracefully.</p><p>Add to <code>server/model.py</code>:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!46Ka!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb851322e-ae14-4b3c-af7a-0be426ca807c_1834x3910.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!46Ka!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb851322e-ae14-4b3c-af7a-0be426ca807c_1834x3910.png 424w, https://substackcdn.com/image/fetch/$s_!46Ka!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb851322e-ae14-4b3c-af7a-0be426ca807c_1834x3910.png 848w, https://substackcdn.com/image/fetch/$s_!46Ka!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb851322e-ae14-4b3c-af7a-0be426ca807c_1834x3910.png 1272w, https://substackcdn.com/image/fetch/$s_!46Ka!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb851322e-ae14-4b3c-af7a-0be426ca807c_1834x3910.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!46Ka!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb851322e-ae14-4b3c-af7a-0be426ca807c_1834x3910.png" width="1456" height="3104" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b851322e-ae14-4b3c-af7a-0be426ca807c_1834x3910.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:3104,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:888301,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpivot.dev/i/179921361?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb851322e-ae14-4b3c-af7a-0be426ca807c_1834x3910.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!46Ka!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb851322e-ae14-4b3c-af7a-0be426ca807c_1834x3910.png 424w, https://substackcdn.com/image/fetch/$s_!46Ka!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb851322e-ae14-4b3c-af7a-0be426ca807c_1834x3910.png 848w, https://substackcdn.com/image/fetch/$s_!46Ka!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb851322e-ae14-4b3c-af7a-0be426ca807c_1834x3910.png 1272w, https://substackcdn.com/image/fetch/$s_!46Ka!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb851322e-ae14-4b3c-af7a-0be426ca807c_1834x3910.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h4>Custom Exception Hierarchy</h4><p>We define custom exceptions for different error types:</p><ul><li><p><strong>ModelLoadError:</strong> Something went wrong loading the model (missing files, incompatible versions)</p></li><li><p><strong>InferenceError:</strong> Inference failed (OOM, CUDA error, unexpected exceptions)</p></li><li><p><strong>InvalidImageError:</strong> Input validation failed (corrupted image, invalid marks)</p></li></ul><p>This lets you (and your API) distinguish between error types:</p><ul><li><p><strong>ModelLoadError</strong>: Don&#8217;t retry, something is fundamentally wrong</p></li><li><p><strong>InferenceError</strong> from OOM: Retry with smaller inputs</p></li><li><p><strong>InvalidImageError</strong>: Client error, return 400 Bad Request</p></li><li><p><strong>InferenceError</strong> from other causes: Might be transient, could retry</p></li></ul><p>Your API can map these to appropriate HTTP status codes. Your logs become more useful when you can filter by error type and identify patterns.</p><h4>Input Validation</h4><pre><code>if x2 &lt;= x1 or y2 &lt;= y1:
    raise InvalidImageError(&#8221;Mark has invalid dimensions&#8221;)</code></pre><p>We validate marks before attempting inference. A bounding box where x2 &#8804; x1 makes no geometric sense. It would have zero or negative width. Catch this early with a clear error message.</p><p>This is way better than letting bad data propagate into the model and causing weird errors deep in the inference stack. &#8220;Mark has invalid dimensions&#8221; is much clearer than some cryptic error from PIL or PyTorch.</p><h4>Memory Management After OOM</h4><pre><code>except torch.cuda.OutOfMemoryError:
    torch.cuda.empty_cache()
    raise InferenceError(&#8221;GPU out of memory...&#8221;)</code></pre><p>After an OOM error, we call<code> torch.cuda.empty_cache()</code> to release unused GPU memory. This doesn&#8217;t free memory actively in use, but it helps prevent fragmentation.</p><p>Use this sparingly because it has overhead. But after an OOM error, it&#8217;s worth trying to clean up before the next request.</p><blockquote><p><strong>&#9888;&#65039; Important:</strong> Catching CUDA OOM errors is tricky. By the time the exception is raised, your GPU might be in a weird state. In production, you might want to restart the process after multiple consecutive OOM errors. It&#8217;s not elegant, but it&#8217;s reliable.</p></blockquote><p>What I usually do: after three OOM errors in a row, log a critical alert and exit gracefully. A process manager (like <code>systemd</code> or Kubernetes) restarts the service. Brief downtime beats serving incorrect results or crashing unexpectedly.</p><h2>Building the API Layer</h2><p>Now let&#8217;s wrap the model server in a REST API. This is the interface clients actually use.</p><h3>Defining Request and Response Schemas</h3><p>Pydantic models define your API&#8217;s contract. They validate input, generate documentation automatically, and make your life much easier.</p><p>Create <code>server/schemas.py</code>:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!OoLX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5f8c8b7-78ca-4d51-8a61-5083e3e7900a_1802x4020.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!OoLX!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5f8c8b7-78ca-4d51-8a61-5083e3e7900a_1802x4020.png 424w, https://substackcdn.com/image/fetch/$s_!OoLX!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5f8c8b7-78ca-4d51-8a61-5083e3e7900a_1802x4020.png 848w, https://substackcdn.com/image/fetch/$s_!OoLX!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5f8c8b7-78ca-4d51-8a61-5083e3e7900a_1802x4020.png 1272w, https://substackcdn.com/image/fetch/$s_!OoLX!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5f8c8b7-78ca-4d51-8a61-5083e3e7900a_1802x4020.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!OoLX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5f8c8b7-78ca-4d51-8a61-5083e3e7900a_1802x4020.png" width="1456" height="3248" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a5f8c8b7-78ca-4d51-8a61-5083e3e7900a_1802x4020.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:3248,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:920761,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpivot.dev/i/179921361?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5f8c8b7-78ca-4d51-8a61-5083e3e7900a_1802x4020.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!OoLX!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5f8c8b7-78ca-4d51-8a61-5083e3e7900a_1802x4020.png 424w, https://substackcdn.com/image/fetch/$s_!OoLX!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5f8c8b7-78ca-4d51-8a61-5083e3e7900a_1802x4020.png 848w, https://substackcdn.com/image/fetch/$s_!OoLX!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5f8c8b7-78ca-4d51-8a61-5083e3e7900a_1802x4020.png 1272w, https://substackcdn.com/image/fetch/$s_!OoLX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5f8c8b7-78ca-4d51-8a61-5083e3e7900a_1802x4020.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h4>What Pydantic Gives You</h4><p><strong>Automatic Validation:</strong> Invalid data gets rejected before hitting your code. If someone sends a string where you expect an integer, Pydantic returns a clear error message. You don&#8217;t write validation code, Pydantic handles it.</p><p><strong>Type Safety That&#8217;s Enforced:</strong> Python&#8217;s type hints are suggestions. <code>mypy</code> can check them statically, but at runtime, Python doesn&#8217;t enforce them. Pydantic enforces type constraints at runtime. This catches bugs early.</p><p><strong>Automatic Documentation:</strong> FastAPI uses these schemas to generate interactive API documentation. Your users see exactly what fields are required, what types they accept, what constraints apply, and example requests. All from your Pydantic models. Zero extra work.</p><p><strong>Serialization:</strong> Pydantic handles conversion between Python objects and JSON automatically. You work with nice Python dataclasses, Pydantic handles the wire format.</p><h4>Field Validators</h4><pre><code>@validator(&#8217;x2&#8217;)
def x2_greater_than_x1(cls, v, values):
    if &#8216;x1&#8217; in values and v &lt;= values[&#8217;x1&#8217;]:
        raise ValueError(&#8217;x2 must be greater than x1&#8217;)
    return v</code></pre><p>Validators let you encode business logic beyond basic type checking. The <code>@validator</code> decorator runs custom validation after Pydantic parses the field. Here we ensure bounding boxes have positive dimensions before the request ever reaches the model layer.</p><p>This works together with the validation in <code>ModelServer.generate()</code>. Bad coordinates get caught at the API boundary with a user-friendly error, not deep in the inference stack.</p><h4>Why Base64 for Images?</h4><p>JSON doesn&#8217;t support binary data natively. Base64 encodes binary as ASCII text, making it JSON-safe. The tradeoff is about a 33% size increase (every 3 bytes becomes 4 characters), but you get simplicity and compatibility. </p><p>Alternative approaches exist:</p><p><strong>multipart/form-data</strong>: Send images as file uploads. More efficient (no encoding overhead), but more complex to implement and test.</p><p><strong>Binary protocols</strong>: Use Protocol Buffers or similar. Fastest option, but way more complex and less debuggable.</p><p>For most use cases, Base64 in JSON is fine. If you&#8217;re sending thousands of images per second, consider alternatives. But at that scale, you have bigger architectural concerns anyway.</p><h4>Configuration Management</h4><p>Settings should live in environment variables, not hardcoded in your source. You should be able to change settings between environments without touching code.</p><p>Create <code>server/config.py</code>:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!rUVn!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41e63030-8f93-425c-b8e0-2fff83ad8067_1380x1490.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!rUVn!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41e63030-8f93-425c-b8e0-2fff83ad8067_1380x1490.png 424w, https://substackcdn.com/image/fetch/$s_!rUVn!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41e63030-8f93-425c-b8e0-2fff83ad8067_1380x1490.png 848w, https://substackcdn.com/image/fetch/$s_!rUVn!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41e63030-8f93-425c-b8e0-2fff83ad8067_1380x1490.png 1272w, https://substackcdn.com/image/fetch/$s_!rUVn!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41e63030-8f93-425c-b8e0-2fff83ad8067_1380x1490.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!rUVn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41e63030-8f93-425c-b8e0-2fff83ad8067_1380x1490.png" width="1380" height="1490" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/41e63030-8f93-425c-b8e0-2fff83ad8067_1380x1490.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1490,&quot;width&quot;:1380,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:279491,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpivot.dev/i/179921361?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41e63030-8f93-425c-b8e0-2fff83ad8067_1380x1490.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!rUVn!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41e63030-8f93-425c-b8e0-2fff83ad8067_1380x1490.png 424w, https://substackcdn.com/image/fetch/$s_!rUVn!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41e63030-8f93-425c-b8e0-2fff83ad8067_1380x1490.png 848w, https://substackcdn.com/image/fetch/$s_!rUVn!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41e63030-8f93-425c-b8e0-2fff83ad8067_1380x1490.png 1272w, https://substackcdn.com/image/fetch/$s_!rUVn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41e63030-8f93-425c-b8e0-2fff83ad8067_1380x1490.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h4>Why External Configuration</h4><p>Different environments need different settings.</p><ul><li><p><strong>Development:</strong> Might use CPU (no GPU), have verbose logging, and permissive CORS</p></li><li><p><strong>Staging:</strong> Might use GPU, have moderate logging, and restricted CORS</p></li><li><p><strong>Production:</strong> Uses GPU, has minimal logging, strict CORS, and monitoring enabled</p></li></ul><p>You change environment variables, not code. Secrets live in environment variables or secret management systems, never in Git!</p><p>This principle comes from the Twelve-Factor App methodology: store config in the environment. It seems obvious once you know it, but violating it causes countless problems.</p><h4>Creating a <code>.env</code> File</h4><p>For local development, create .env in your project root (and add it to <code>.gitignore</code>):</p><pre><code>MODEL_PATH=microsoft/Magma-8B
DEVICE=cuda
API_HOST=0.0.0.0
API_PORT=8000
MAX_IMAGE_SIZE=10485760
LOG_LEVEL=INFO
CORS_ORIGINS=*</code></pre><p>Pydantic loads this automatically. In production, you&#8217;d set these as actual environment variables through Docker, Kubernetes, or your deployment platform.</p><h3>FastAPI Implementation</h3><p>Now let&#8217;s build the actual API. This is where everything comes together.</p><p>Create <code>server/main.py</code>:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!7g3D!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d9e24b9-fd9e-465d-94c4-b9c149375191_1750x8974.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!7g3D!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d9e24b9-fd9e-465d-94c4-b9c149375191_1750x8974.png 424w, https://substackcdn.com/image/fetch/$s_!7g3D!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d9e24b9-fd9e-465d-94c4-b9c149375191_1750x8974.png 848w, https://substackcdn.com/image/fetch/$s_!7g3D!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d9e24b9-fd9e-465d-94c4-b9c149375191_1750x8974.png 1272w, https://substackcdn.com/image/fetch/$s_!7g3D!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d9e24b9-fd9e-465d-94c4-b9c149375191_1750x8974.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!7g3D!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d9e24b9-fd9e-465d-94c4-b9c149375191_1750x8974.png" width="1456" height="7466" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9d9e24b9-fd9e-465d-94c4-b9c149375191_1750x8974.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:7466,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2079388,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpivot.dev/i/179921361?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d9e24b9-fd9e-465d-94c4-b9c149375191_1750x8974.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!7g3D!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d9e24b9-fd9e-465d-94c4-b9c149375191_1750x8974.png 424w, https://substackcdn.com/image/fetch/$s_!7g3D!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d9e24b9-fd9e-465d-94c4-b9c149375191_1750x8974.png 848w, https://substackcdn.com/image/fetch/$s_!7g3D!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d9e24b9-fd9e-465d-94c4-b9c149375191_1750x8974.png 1272w, https://substackcdn.com/image/fetch/$s_!7g3D!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d9e24b9-fd9e-465d-94c4-b9c149375191_1750x8974.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h4>Understanding the API Components</h4><h5><strong>Lifespan Events</strong></h5><pre><code>@asynccontextmanager
async def lifespan(app: FastAPI):
    logger.info(&#8221;Loading model...&#8221;)
    model_server.load_model(settings.MODEL_PATH, settings.DEVICE)
    yield
    logger.info(&#8221;Shutting down...&#8221;)</code></pre><p>This replaced the deprecated <code>@app.on_event(&#8220;startup&#8221;) </code>decorator. Code before <code>yield</code> runs at startup, code after runs at shutdown.</p><p>This ensures:</p><ul><li><p>A model loads before accepting requests (no 500 errors from an uninitialized model)</p></li><li><p>Resources clean up gracefully on shutdown (important for Docker containers)</p></li><li><p>You can run initialization code that might fail (and handle failures appropriately)</p></li></ul><h5>Health vs Readiness Checks</h5><pre><code>@app.get(&#8221;/health&#8221;)
async def health_check():
    return {&#8221;status&#8221;: &#8220;healthy&#8221;}

@app.get(&#8221;/ready&#8221;)
async def readiness_check():
    _ = model_server.model  # Raises if not loaded
    return {&#8221;status&#8221;: &#8220;ready&#8221;, &#8220;model_loaded&#8221;: True}</code></pre><p>These serve different purposes and are essential for production deployments.</p><p><strong>/health:</strong> Is the service running? Returns 200 if the process is alive. Load balancers use this for liveness probes.</p><p><strong>/ready:</strong> Is the service ready to handle requests? Returns 200 only if the model is loaded. Load balancers use this for readiness probes.</p><p>Kubernetes and other orchestrators use these endpoints:</p><ul><li><p>Liveness probe: If this fails repeatedly, restart the container (it&#8217;s dead)</p></li><li><p>Readiness probe: If this fails, stop sending traffic (it&#8217;s not ready yet)</p></li></ul><p>During deployment, traffic only routes to pods that pass readiness checks. This prevents users from hitting a server that&#8217;s still loading its 16GB model, which is essential for zero-downtime deployments.</p><h5><strong>CORS Middleware</strong></h5><pre><code>app.add_middleware(
    CORSMiddleware,
    allow_origins=cors_origins,
    allow_credentials=True,
    allow_methods=[&#8221;*&#8221;],
    allow_headers=[&#8221;*&#8221;],
)</code></pre><p>This allows web browsers to call your API from different domains. Without CORS, browsers block cross-origin requests for security.</p><p>The <code>allow_origins=[&#8220;*&#8221;]</code> setting allows requests from any origin, which is fine for development. In production, restrict this to your actual domains:</p><pre><code>allow_origins=[&#8221;https://myapp.com&#8221;, &#8220;https://www.myapp.com&#8221;]</code></pre><p>If you&#8217;re only doing server-to-server communication, you don&#8217;t need CORS. Browsers enforce CORS policies, servers don&#8217;t care.</p><h5>Base64 Decoding with Validation</h5><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!TWKa!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd5e8219-d822-4e26-9720-03587e277826_1262x1080.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!TWKa!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd5e8219-d822-4e26-9720-03587e277826_1262x1080.png 424w, https://substackcdn.com/image/fetch/$s_!TWKa!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd5e8219-d822-4e26-9720-03587e277826_1262x1080.png 848w, https://substackcdn.com/image/fetch/$s_!TWKa!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd5e8219-d822-4e26-9720-03587e277826_1262x1080.png 1272w, https://substackcdn.com/image/fetch/$s_!TWKa!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd5e8219-d822-4e26-9720-03587e277826_1262x1080.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!TWKa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd5e8219-d822-4e26-9720-03587e277826_1262x1080.png" width="1262" height="1080" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dd5e8219-d822-4e26-9720-03587e277826_1262x1080.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1080,&quot;width&quot;:1262,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:230655,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpivot.dev/i/179921361?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd5e8219-d822-4e26-9720-03587e277826_1262x1080.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!TWKa!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd5e8219-d822-4e26-9720-03587e277826_1262x1080.png 424w, https://substackcdn.com/image/fetch/$s_!TWKa!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd5e8219-d822-4e26-9720-03587e277826_1262x1080.png 848w, https://substackcdn.com/image/fetch/$s_!TWKa!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd5e8219-d822-4e26-9720-03587e277826_1262x1080.png 1272w, https://substackcdn.com/image/fetch/$s_!TWKa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd5e8219-d822-4e26-9720-03587e277826_1262x1080.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This handles several cases.</p><p><strong>Data URLs:</strong> Browsers often send <code>data:image/png;base64,...</code>. We strip the prefix.</p><p><strong>Size validation:</strong> Reject images larger than <code>MAX_IMAGE_SIZE</code>. Prevents memory exhaustion attacks.</p><p><strong>Image verification:</strong> <code>image.verify()</code> checks if the file is actually a valid image. Catches corrupted files early.</p><p><strong>Reload after verify:</strong> <code>verify()</code> closes the file handle, so we need to reload.</p><p><strong>Color mode normalization: </strong>Convert to RGB. Some images use RGBA (transparency), L (grayscale), or other modes. The model expects consistent RGB input.</p><h5>HTTP Status Codes</h5><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!NwVJ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25922664-2314-4433-9266-be9c347eb346_1802x558.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!NwVJ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25922664-2314-4433-9266-be9c347eb346_1802x558.png 424w, https://substackcdn.com/image/fetch/$s_!NwVJ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25922664-2314-4433-9266-be9c347eb346_1802x558.png 848w, https://substackcdn.com/image/fetch/$s_!NwVJ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25922664-2314-4433-9266-be9c347eb346_1802x558.png 1272w, https://substackcdn.com/image/fetch/$s_!NwVJ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25922664-2314-4433-9266-be9c347eb346_1802x558.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!NwVJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25922664-2314-4433-9266-be9c347eb346_1802x558.png" width="1456" height="451" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/25922664-2314-4433-9266-be9c347eb346_1802x558.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:451,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:142280,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpivot.dev/i/179921361?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25922664-2314-4433-9266-be9c347eb346_1802x558.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!NwVJ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25922664-2314-4433-9266-be9c347eb346_1802x558.png 424w, https://substackcdn.com/image/fetch/$s_!NwVJ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25922664-2314-4433-9266-be9c347eb346_1802x558.png 848w, https://substackcdn.com/image/fetch/$s_!NwVJ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25922664-2314-4433-9266-be9c347eb346_1802x558.png 1272w, https://substackcdn.com/image/fetch/$s_!NwVJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25922664-2314-4433-9266-be9c347eb346_1802x558.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Different status codes communicate different things.</p><p><strong>400 Bad Request:</strong> Client&#8217;s fault (invalid data, malformed input). Client should fix their request. Don&#8217;t retry the prior request.</p><p><strong>500 Internal Server Error:</strong> Server&#8217;s fault (inference failed, unexpected error). Client can retry, it may or may not work.</p><p><strong>503 Service Unavailable:</strong> Service exists but isn&#8217;t ready (model hasn&#8217;t loaded). Client should wait and retry.</p><p>This distinction matters. Well-behaved clients know:</p><ul><li><p>Don&#8217;t retry 400 errors (the problem is in the request)</p></li><li><p>Do retry 500/503 errors with exponential backoff (it&#8217;s a temporary issue, might resolve)</p></li></ul><p>HTTP status codes are a universal language. Your API speaks the same language as every other HTTP service. Use them correctly.</p><h2>Building the Client</h2><p>Your client&#8217;s job is simple: send images and prompts, and receive responses. Let&#8217;s build a clean Python client.</p><p>Create <code>client/client.py</code>:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!mB7j!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa41e6c80-bc02-47fe-a367-63b8651b9f87_1902x7038.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!mB7j!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa41e6c80-bc02-47fe-a367-63b8651b9f87_1902x7038.png 424w, https://substackcdn.com/image/fetch/$s_!mB7j!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa41e6c80-bc02-47fe-a367-63b8651b9f87_1902x7038.png 848w, https://substackcdn.com/image/fetch/$s_!mB7j!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa41e6c80-bc02-47fe-a367-63b8651b9f87_1902x7038.png 1272w, https://substackcdn.com/image/fetch/$s_!mB7j!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa41e6c80-bc02-47fe-a367-63b8651b9f87_1902x7038.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!mB7j!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa41e6c80-bc02-47fe-a367-63b8651b9f87_1902x7038.png" width="1456" height="5388" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a41e6c80-bc02-47fe-a367-63b8651b9f87_1902x7038.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:5388,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1566545,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpivot.dev/i/179921361?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa41e6c80-bc02-47fe-a367-63b8651b9f87_1902x7038.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!mB7j!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa41e6c80-bc02-47fe-a367-63b8651b9f87_1902x7038.png 424w, https://substackcdn.com/image/fetch/$s_!mB7j!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa41e6c80-bc02-47fe-a367-63b8651b9f87_1902x7038.png 848w, https://substackcdn.com/image/fetch/$s_!mB7j!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa41e6c80-bc02-47fe-a367-63b8651b9f87_1902x7038.png 1272w, https://substackcdn.com/image/fetch/$s_!mB7j!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa41e6c80-bc02-47fe-a367-63b8651b9f87_1902x7038.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>Key Design Decisions</h3><h4>Session Reuse</h4><pre><code>self.session = requests.Session()</code></pre><p>TCP connections are expensive to create. A <code>Session</code> reuses connections through connection pooling. For multiple requests to the same host, this is significantly faster.</p><p>This is the same principle as database connection pooling. Create expensive resources once, reuse them many times.</p><h4>Timeouts Are Essential</h4><pre><code>timeout=30</code></pre><p>Without a timeout, requests can hang indefinitely. Your program just waits forever, consuming resources.</p><p>30 seconds is reasonable for ML inference. Always set timeout in production. Always!</p><h4>Retry Logic with Exponential Backoff</h4><pre><code>for attempt in range(self.max_retries):
    try:
        # ... make request ...
    except requests.exceptions.Timeout:
        if attempt == self.max_retries - 1:
            raise
        time.sleep(2 ** attempt)  # 1s, 2s, 4s, ...</code></pre><p>Transient errors (like network blips or temporary server overload) often resolve if you wait and retry. Exponential backoff prevents overwhelming the server with retries.</p><blockquote><p>We retry server errors (5xx) but not client errors (4xx). <strong>If you sent invalid data, retrying won&#8217;t help</strong>.</p></blockquote><h4>Context Manager Support</h4><pre><code>with VisionAPIClient() as client:
    result = client.analyze(&#8221;image.jpg&#8221;, &#8220;What is this?&#8221;)</code></pre><p>The <code>__enter__</code> and <code>__exit__ </code>methods let you use the client as a context manager. This ensures the session closes properly even if an exception occurs.</p><h2>Testing the System</h2><p>Time to see everything work together.</p><h3>Running the Server</h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!kMLj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02ecee9b-4d4e-425c-9101-1cc1c1a9dfee_1328x744.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!kMLj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02ecee9b-4d4e-425c-9101-1cc1c1a9dfee_1328x744.png 424w, https://substackcdn.com/image/fetch/$s_!kMLj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02ecee9b-4d4e-425c-9101-1cc1c1a9dfee_1328x744.png 848w, https://substackcdn.com/image/fetch/$s_!kMLj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02ecee9b-4d4e-425c-9101-1cc1c1a9dfee_1328x744.png 1272w, https://substackcdn.com/image/fetch/$s_!kMLj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02ecee9b-4d4e-425c-9101-1cc1c1a9dfee_1328x744.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!kMLj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02ecee9b-4d4e-425c-9101-1cc1c1a9dfee_1328x744.png" width="1328" height="744" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/02ecee9b-4d4e-425c-9101-1cc1c1a9dfee_1328x744.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:744,&quot;width&quot;:1328,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:134838,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpivot.dev/i/179921361?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02ecee9b-4d4e-425c-9101-1cc1c1a9dfee_1328x744.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!kMLj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02ecee9b-4d4e-425c-9101-1cc1c1a9dfee_1328x744.png 424w, https://substackcdn.com/image/fetch/$s_!kMLj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02ecee9b-4d4e-425c-9101-1cc1c1a9dfee_1328x744.png 848w, https://substackcdn.com/image/fetch/$s_!kMLj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02ecee9b-4d4e-425c-9101-1cc1c1a9dfee_1328x744.png 1272w, https://substackcdn.com/image/fetch/$s_!kMLj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02ecee9b-4d4e-425c-9101-1cc1c1a9dfee_1328x744.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>You should see:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!3biZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff694bfe4-94db-4eb6-809a-1cb9bd635a53_1042x632.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!3biZ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff694bfe4-94db-4eb6-809a-1cb9bd635a53_1042x632.png 424w, https://substackcdn.com/image/fetch/$s_!3biZ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff694bfe4-94db-4eb6-809a-1cb9bd635a53_1042x632.png 848w, https://substackcdn.com/image/fetch/$s_!3biZ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff694bfe4-94db-4eb6-809a-1cb9bd635a53_1042x632.png 1272w, https://substackcdn.com/image/fetch/$s_!3biZ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff694bfe4-94db-4eb6-809a-1cb9bd635a53_1042x632.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!3biZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff694bfe4-94db-4eb6-809a-1cb9bd635a53_1042x632.png" width="1042" height="632" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f694bfe4-94db-4eb6-809a-1cb9bd635a53_1042x632.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:632,&quot;width&quot;:1042,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:131496,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpivot.dev/i/179921361?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff694bfe4-94db-4eb6-809a-1cb9bd635a53_1042x632.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!3biZ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff694bfe4-94db-4eb6-809a-1cb9bd635a53_1042x632.png 424w, https://substackcdn.com/image/fetch/$s_!3biZ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff694bfe4-94db-4eb6-809a-1cb9bd635a53_1042x632.png 848w, https://substackcdn.com/image/fetch/$s_!3biZ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff694bfe4-94db-4eb6-809a-1cb9bd635a53_1042x632.png 1272w, https://substackcdn.com/image/fetch/$s_!3biZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff694bfe4-94db-4eb6-809a-1cb9bd635a53_1042x632.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>First time running, the model downloads from Hugging Face. This can take a while (remember it&#8217;s 16GB). However, subsequent runs load from cache.</p><h4>Testing with the Client</h4><p>Create <code>test_api.py</code>:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!FyM1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54d85591-118f-492d-9d32-16041ed3efd3_1650x3984.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!FyM1!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54d85591-118f-492d-9d32-16041ed3efd3_1650x3984.png 424w, https://substackcdn.com/image/fetch/$s_!FyM1!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54d85591-118f-492d-9d32-16041ed3efd3_1650x3984.png 848w, https://substackcdn.com/image/fetch/$s_!FyM1!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54d85591-118f-492d-9d32-16041ed3efd3_1650x3984.png 1272w, https://substackcdn.com/image/fetch/$s_!FyM1!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54d85591-118f-492d-9d32-16041ed3efd3_1650x3984.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!FyM1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54d85591-118f-492d-9d32-16041ed3efd3_1650x3984.png" width="1456" height="3516" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/54d85591-118f-492d-9d32-16041ed3efd3_1650x3984.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:3516,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:998925,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpivot.dev/i/179921361?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54d85591-118f-492d-9d32-16041ed3efd3_1650x3984.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!FyM1!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54d85591-118f-492d-9d32-16041ed3efd3_1650x3984.png 424w, https://substackcdn.com/image/fetch/$s_!FyM1!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54d85591-118f-492d-9d32-16041ed3efd3_1650x3984.png 848w, https://substackcdn.com/image/fetch/$s_!FyM1!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54d85591-118f-492d-9d32-16041ed3efd3_1650x3984.png 1272w, https://substackcdn.com/image/fetch/$s_!FyM1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54d85591-118f-492d-9d32-16041ed3efd3_1650x3984.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h4>Expected Behavior</h4><p><strong>Test 1:</strong> Model should describe both shapes (a red circle and a blue square).</p><p><strong>Test 2:</strong> Model should focus specifically on the red circle, mentioning it&#8217;s red.</p><p><strong>Test 3:</strong> Model should compare both marked objects, discussing colors and shapes of each.</p><p><strong>Test 4:</strong> Model should reason about spatial relationships between the marked objects.</p><p>If you see reasonable responses to all tests, congratulations &#127881;! Your Set-of-Mark vision API is working.</p><h2>Common Issues and Solutions</h2><h4>&#8220;Model not found&#8221; Error</h4><p>Check logs during startup for the actual error. Verify <code>MODEL_PATH</code> points to a valid model. First download can be slow. Check your Hugging Face cache at <code>~/.cache/huggingface/</code>.</p><p>Try explicitly downloading the model first:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!UJTu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58e6e93e-d2d2-4d5c-984a-26e47ea15749_1818x520.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!UJTu!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58e6e93e-d2d2-4d5c-984a-26e47ea15749_1818x520.png 424w, https://substackcdn.com/image/fetch/$s_!UJTu!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58e6e93e-d2d2-4d5c-984a-26e47ea15749_1818x520.png 848w, https://substackcdn.com/image/fetch/$s_!UJTu!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58e6e93e-d2d2-4d5c-984a-26e47ea15749_1818x520.png 1272w, https://substackcdn.com/image/fetch/$s_!UJTu!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58e6e93e-d2d2-4d5c-984a-26e47ea15749_1818x520.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!UJTu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58e6e93e-d2d2-4d5c-984a-26e47ea15749_1818x520.png" width="1456" height="416" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/58e6e93e-d2d2-4d5c-984a-26e47ea15749_1818x520.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:416,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:141172,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpivot.dev/i/179921361?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58e6e93e-d2d2-4d5c-984a-26e47ea15749_1818x520.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!UJTu!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58e6e93e-d2d2-4d5c-984a-26e47ea15749_1818x520.png 424w, https://substackcdn.com/image/fetch/$s_!UJTu!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58e6e93e-d2d2-4d5c-984a-26e47ea15749_1818x520.png 848w, https://substackcdn.com/image/fetch/$s_!UJTu!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58e6e93e-d2d2-4d5c-984a-26e47ea15749_1818x520.png 1272w, https://substackcdn.com/image/fetch/$s_!UJTu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58e6e93e-d2d2-4d5c-984a-26e47ea15749_1818x520.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h4>CUDA Out of Memory</h4><p>You need a GPU with at least 24GB VRAM. Check with <code>nvidia-smi</code>.</p><p>If you don&#8217;t have enough VRAM:</p><ul><li><p>Try <code>DEVICE=&#8221;cpu&#8221; </code>(it&#8217;s much slower, but works)</p></li><li><p>Close other GPU-using programs</p></li><li><p>Reduce image size before sending</p></li><li><p>Lower <code>max_tokens</code></p></li></ul><h4>Timeout Errors</h4><p>First request is often slower, so you should increase timeout in client:</p><pre><code>client = VisionAPIClient(timeout=60)</code></pre><p>Reduce <code>max_tokens</code> if responses are very long. Check GPU utilization during requests:</p><pre><code>watch -n 0.5 nvidia-smi</code></pre><p>You should see GPU usage spike during inference.</p><h4>Image Encoding Errors</h4><p>Verify the image file exists and is valid. Check format is supported (JPEG or PNG work, while WebP might cause issues). Try resaving the image in a different format.</p><h4>Model Responses Seem Wrong</h4><p>Check if model is in eval mode (we do this in <code>load_model</code>). Verify model actually loaded by checking the <code>/ready</code> endpoint. Try simpler prompts first to verify basic functionality. Save marked images and visually inspect them.</p><h4>Performance Issues</h4><p>Measure request latency:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ZFae!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd703ebf0-e5d0-41e0-b534-4b2a6adf63d2_1834x596.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ZFae!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd703ebf0-e5d0-41e0-b534-4b2a6adf63d2_1834x596.png 424w, https://substackcdn.com/image/fetch/$s_!ZFae!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd703ebf0-e5d0-41e0-b534-4b2a6adf63d2_1834x596.png 848w, https://substackcdn.com/image/fetch/$s_!ZFae!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd703ebf0-e5d0-41e0-b534-4b2a6adf63d2_1834x596.png 1272w, https://substackcdn.com/image/fetch/$s_!ZFae!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd703ebf0-e5d0-41e0-b534-4b2a6adf63d2_1834x596.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ZFae!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd703ebf0-e5d0-41e0-b534-4b2a6adf63d2_1834x596.png" width="1456" height="473" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d703ebf0-e5d0-41e0-b534-4b2a6adf63d2_1834x596.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:473,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:140470,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpivot.dev/i/179921361?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd703ebf0-e5d0-41e0-b534-4b2a6adf63d2_1834x596.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ZFae!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd703ebf0-e5d0-41e0-b534-4b2a6adf63d2_1834x596.png 424w, https://substackcdn.com/image/fetch/$s_!ZFae!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd703ebf0-e5d0-41e0-b534-4b2a6adf63d2_1834x596.png 848w, https://substackcdn.com/image/fetch/$s_!ZFae!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd703ebf0-e5d0-41e0-b534-4b2a6adf63d2_1834x596.png 1272w, https://substackcdn.com/image/fetch/$s_!ZFae!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd703ebf0-e5d0-41e0-b534-4b2a6adf63d2_1834x596.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This snippet wraps a request with timing so you can see how long inference takes. First requests typically run slower because of JIT compilation and cache warming. After that, expect 0.5-2 seconds per request on GPU. If you&#8217;re seeing 10+ seconds, check that the GPU is actually being used with <code>nvidia-smi</code> while a request is running.</p><h2>What You Built &#129520;</h2><p>You have a complete production-grade vision API that handles Set-of-Mark prompting. </p><p>The model server uses a Singleton pattern to load Magma8B once and serve requests efficiently. The FastAPI layer validates inputs, handles errors gracefully, and generates documentation automatically. The Python client includes retry logic and connection pooling for reliable communication.</p><p>The patterns here transfer beyond ML serving:</p><ul><li><p><strong>Singleton pattern </strong>for expensive resource management</p></li><li><p><strong>Pipeline pattern</strong> for data transformation (image &#8594; marks &#8594; inference &#8594; response) </p></li><li><p><strong>Separation of concerns</strong> across API, model, and validation layers</p></li><li><p><strong>Fail-fast validation</strong> at system boundaries</p></li><li><p><strong>Custom exception hierarchy</strong> for meaningful error handling</p></li><li><p><strong>Configuration via environment variables</strong></p></li></ul><p>These show up everywhere. Database connection pools use Singleton. ETL pipelines follow the same transformation patterns. Microservices separate concerns across network boundaries. Every production system validates at boundaries and handles errors explicitly.</p><p>What separates this from a simple script or a Jupyter notebook is the infrastructure around the model. There&#8217;s proper error handling so failures don&#8217;t crash the server, health checks so load balancers know when to route traffic, logging so you can debug issues in production, and configuration management so you can change settings without redeploying code. These details compound. </p><h2>Looking Ahead</h2><p>In Part 2, we&#8217;ll tackle production deployment:</p><ul><li><p><strong>Dockerization</strong>: Package everything for consistent deployment</p></li><li><p><strong>Scaling strategies</strong>: Handle concurrent requests, load balancing</p></li><li><p><strong>Monitoring</strong>: Metrics, logging, alerting that actually helps</p></li><li><p><strong>Performance optimization</strong>: Caching, batching, query optimization</p></li><li><p><strong>Security hardening</strong>: Authentication, rate limiting, input sanitization</p></li></ul><p>These tools and techniques will allow you to run your application on any machine!</p><h2>Try It Yourself</h2><p>Before moving on, I encourage you to:</p><ol><li><p>Run the code locally and verify everything works</p></li><li><p>Test with your own images and prompts</p></li><li><p>Experiment with different mark configurations</p></li><li><p>Try various prompts and see how responses change</p></li><li><p>Measure performance with different settings</p></li><li><p>Intentionally break something and see how error handling works</p></li></ol><p>The best way to learn is by doing. Play around with the code. Change things. See what breaks. That&#8217;s how you develop intuition for what works and why.</p><p>You&#8217;re not just serving models anymore. You&#8217;re building systems.</p><p>And hey, if you made it this far, you&#8217;re doing great! Building production ML systems is hard. There&#8217;s a lot to learn. But you&#8217;re on the right path &#127775;.</p><p>See you in Part 2 where we&#8217;ll make this production-ready!</p><p>Happy coding &#128187;</p><p></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://mlpivot.dev/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[How to Save $150K/Year Self-Hosting LLMs ]]></title><description><![CDATA[Self-hosting LLMs saves $150,000 per year at scale. But going production-ready the wrong way burns that savings fast. Learn the deployment strategies that protect your budget]]></description><link>https://mlpivot.dev/p/production-ready-systems</link><guid isPermaLink="false">https://mlpivot.dev/p/production-ready-systems</guid><dc:creator><![CDATA[The Machine Learning Pivot]]></dc:creator><pubDate>Sat, 29 Nov 2025 02:27:29 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/ab5f1541-bd81-46ee-813b-a491f59fd9d9_1024x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>You&#8217;ve spent weeks building your vLLM setup. Tests pass. Monitoring dashboards show green. Your fallback strategy works. Everything runs perfectly on your laptop. Then someone asks the inevitable question: &#8220;Are we ready for production?&#8221;</p><p>That question reveals a significant gap. Running <code>docker-compose up</code> on your machine is one thing. Serving production traffic to users who expect reliability is entirely different. The distance between these two stages is larger than most engineers expect.</p><p>At 5 million requests per month, downtime has real consequences. That&#8217;s roughly 7 requests per second. When vLLM goes down, revenue drops. Users get frustrated. Your phone starts buzzing with alerts. You can&#8217;t experiment with production traffic without risking your reputation and your users&#8217; trust.</p><p>This post bridges the gap. We&#8217;re taking everything from the previous parts of this series and building a production system that handles real user traffic reliably. </p><h2>What We&#8217;ve Built So Far</h2><p>The previous posts in this series established the foundation for a robust system. Part 1 identified problems with notebook code: no type safety, no tests, just pure chaos. Part 2 introduced BAML to add structure and type safety to LLM interactions. Part 3 covered comprehensive testing strategies to catch bugs before production. Part 4 walked through vLLM self-hosting, monitoring, dashboards, and fallback strategies to OpenAI.</p><p>These pieces work well independently, but they&#8217;re still isolated components. Now we need to connect them into a cohesive production system. This means automated deployments that don&#8217;t require manual SSH sessions. Proper deployment strategies that minimize risk. Rollback plans for when things inevitably go wrong. Runbooks so your team knows exactly what to do at 3AM. Operational patterns that prevent most of those 3 AM emergencies in the first place.</p><h2>The Complete Architecture</h2><p>Your production system needs several components working together seamlessly.</p><p>Start with a load balancer that distributes traffic across API instances. This load balancer handles SSL termination so your API services don&#8217;t worry about certificates. It kicks out unhealthy instances automatically so users never hit broken services. It implements rate limiting to prevent abuse and ensures no single user consumes all your resources. You can use AWS Application Load Balancer, nginx, or HAProxy depending on your infrastructure preferences and existing expertise.</p><p>Behind the load balancer sit multiple API service instances. Having multiple instances is critical for reliability. If one instance crashes because of a memory leak or unexpected bug, traffic continues flowing to the healthy instances. Each instance runs your BAML code from Part 2, handles HTTP requests from users, calls vLLM for inference, and falls back to OpenAI when needed. Running three to five instances provides good redundancy without excessive costs.</p><p>The API services connect to vLLM instances for model inference. These are GPU-backed containers running vLLM from Part 4. Each instance operates independently with its own GPU allocation. The load balancer distributes inference requests across them using algorithms like round-robin or least connections. This distribution prevents any single GPU from becoming a bottleneck while others sit idle.</p><p>When vLLM fails, requests automatically fall back to OpenAI. This safety net catches you when self-hosted infrastructure has problems. Part 4 established this fallback pattern using circuit breakers that detect failures and route traffic appropriately. The circuit breaker prevents cascading failures where every request times out waiting for a broken LLM instance.</p><p>Monitoring and logging sit beneath everything, collecting data about system health. Prometheus tracks metrics like request rates, error rates, latency percentiles, and GPU utilization. Grafana visualizes these metrics in dashboards that make system health obvious at a glance. Structured logs capture detailed information about what&#8217;s happening and why. This observability layer is absolutely essential for maintaining sanity in production. You can&#8217;t fix problems you can&#8217;t see.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://mlpivot.dev/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h3>Directory Organization</h3><p>Organization matters more than you might expect when working on production systems. A well-structured project helps team members find what they need quickly. New engineers understand the system layout without extensive explanation. Here&#8217;s how to structure your production application.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!SxF_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd30c1940-94c9-4ee4-8208-a561f37a52d3_822x1414.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!SxF_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd30c1940-94c9-4ee4-8208-a561f37a52d3_822x1414.png 424w, https://substackcdn.com/image/fetch/$s_!SxF_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd30c1940-94c9-4ee4-8208-a561f37a52d3_822x1414.png 848w, https://substackcdn.com/image/fetch/$s_!SxF_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd30c1940-94c9-4ee4-8208-a561f37a52d3_822x1414.png 1272w, https://substackcdn.com/image/fetch/$s_!SxF_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd30c1940-94c9-4ee4-8208-a561f37a52d3_822x1414.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!SxF_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd30c1940-94c9-4ee4-8208-a561f37a52d3_822x1414.png" width="822" height="1414" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d30c1940-94c9-4ee4-8208-a561f37a52d3_822x1414.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1414,&quot;width&quot;:822,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:198460,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpivot.dev/i/179407885?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd30c1940-94c9-4ee4-8208-a561f37a52d3_822x1414.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!SxF_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd30c1940-94c9-4ee4-8208-a561f37a52d3_822x1414.png 424w, https://substackcdn.com/image/fetch/$s_!SxF_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd30c1940-94c9-4ee4-8208-a561f37a52d3_822x1414.png 848w, https://substackcdn.com/image/fetch/$s_!SxF_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd30c1940-94c9-4ee4-8208-a561f37a52d3_822x1414.png 1272w, https://substackcdn.com/image/fetch/$s_!SxF_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd30c1940-94c9-4ee4-8208-a561f37a52d3_822x1414.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The <code>api</code> directory contains your main application code, tests from Part 3, and BAML definitions from Part 2. Everything related to the API service lives here. The <code>vllm</code> directory holds vLLM-specific Docker configuration and startup scripts. Keeping this separate from the API makes it easier to update vLLM independently.</p><p>The <code>infrastructure</code> directory stores deployment configurations. Terraform files define your cloud resources declaratively. Kubernetes manifests describe how to deploy your services. Docker Compose production configurations show how pieces fit together. Having all infrastructure-as-code in one place makes environments reproducible and prevents configuration drift.</p><p>The <code>monitoring</code> directory contains Prometheus scraping configurations and Grafana dashboard definitions. Storing dashboards as code means you can version control them and restore them if someone accidentally deletes a dashboard.</p><p>The <code>.github/workflows </code>directory holds CI/CD pipeline definitions. These YAML files automate testing, building, and deploying your application. The <code>scripts</code> directory contains bash scripts for common operations like deploying, rolling back, and health checking. These scripts get used by CI/CD pipelines and can also be run manually when needed.</p><p>The <code>docs</code> directory stores operational documentation. The runbook explains how to handle common issues. The architecture document describes how components interact. Future team members will thank you for thorough documentation.</p><p>This structure separates concerns clearly while keeping related files together. Engineers can navigate the codebase efficiently, which becomes increasingly important as your team grows.</p><h2>Implementing CI/CD</h2><p>Manual deployments create problems that compound over time. Before implementing CI/CD, deployments required SSH access to production servers. You&#8217;d manually stop services, pull new Docker images, restart everything, and hope it worked correctly. Each engineer performed these steps slightly differently. One engineer might restart services in a different order than another. Someone might skip health checks in a rush. There&#8217;s no automated record of what changed or when it changed.</p><p>This approach fails at scale. With five deployments per week, manual processes consume significant engineering time. Each deployment takes 30-45 minutes including coordination, execution, and verification. That&#8217;s 2.5-3.75 hours per week just on deployment mechanics. Documentation for manual processes goes stale quickly because maintaining deployment docs requires discipline most teams lack. New team members take weeks to learn the deployment process through shadowing and trial by error. When something breaks during deployment, debugging is difficult because you don&#8217;t know exactly what changed or in what order.</p><p>CI/CD solves these problems through automation and consistency. Every code change triggers tests automatically. Tests must pass before any deployment proceeds. This catches bugs early before they reach production. Deployments happen exactly the same way every time, eliminating inconsistencies from human variation. You get a complete audit trail showing what changed, when it changed, and who approved it. Most importantly, bugs get caught in CI rather than production. The testing happens before code merges, preventing broken code from ever reaching users.</p><p>The investment in CI/CD pays dividends quickly. After spending a few days setting up pipelines, deployments become routine operations instead of anxiety-inducing events. You can safely deploy ten times per day if needed. New engineers deploy confidently during their first week because the process is automated and documented. When problems occur, the audit trail helps you identify exactly what changed. The consistency means fewer deployment-related incidents because you&#8217;re not introducing human error.</p><h3>The Test Pipeline</h3><p>This pipeline runs on every pull request and prevents broken code from reaching the main branch. Every PR must pass these tests before merging. </p><p>Create this file at <code>.github/workflows/test.yml</code>:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!yGEu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe376f049-a74c-41b5-bb9e-a6348fff8507_1228x2272.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!yGEu!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe376f049-a74c-41b5-bb9e-a6348fff8507_1228x2272.png 424w, https://substackcdn.com/image/fetch/$s_!yGEu!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe376f049-a74c-41b5-bb9e-a6348fff8507_1228x2272.png 848w, https://substackcdn.com/image/fetch/$s_!yGEu!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe376f049-a74c-41b5-bb9e-a6348fff8507_1228x2272.png 1272w, https://substackcdn.com/image/fetch/$s_!yGEu!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe376f049-a74c-41b5-bb9e-a6348fff8507_1228x2272.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!yGEu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe376f049-a74c-41b5-bb9e-a6348fff8507_1228x2272.png" width="1228" height="2272" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e376f049-a74c-41b5-bb9e-a6348fff8507_1228x2272.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:2272,&quot;width&quot;:1228,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:361834,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpivot.dev/i/179407885?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe376f049-a74c-41b5-bb9e-a6348fff8507_1228x2272.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!yGEu!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe376f049-a74c-41b5-bb9e-a6348fff8507_1228x2272.png 424w, https://substackcdn.com/image/fetch/$s_!yGEu!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe376f049-a74c-41b5-bb9e-a6348fff8507_1228x2272.png 848w, https://substackcdn.com/image/fetch/$s_!yGEu!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe376f049-a74c-41b5-bb9e-a6348fff8507_1228x2272.png 1272w, https://substackcdn.com/image/fetch/$s_!yGEu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe376f049-a74c-41b5-bb9e-a6348fff8507_1228x2272.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Unit tests run on every pull request because they&#8217;re fast (usually completing in under 2 minutes) and free. They validate your code&#8217;s logic without making external API calls. These tests catch type errors, logic bugs, and regression issues early in the development cycle. Catching bugs during code review is much cheaper than catching them in production.</p><p>Integration tests only run on the main branch because they&#8217;re slower and cost money through OpenAI API calls. Running them on every PR would waste money testing code that might not even merge. Running them on main provides a safety net after merge, confirming the integrated system still works correctly. If integration tests fail on main, you know immediately and can investigate before deployment.</p><p>&#9888;&#65039; Never commit API keys directly to your repository. Use GitHub Secrets to store sensitive credentials. Navigate to Settings, then Secrets and Variables, then Actions to add secrets. GitHub encrypts these secrets and only makes them available to workflows. This prevents credentials from leaking into version control history, which would require key rotation and potential security incident response.</p><p>The coverage report uploads to Codecov, giving you visibility into which code paths your tests exercise. Aim for 80% coverage as a minimum. Lower coverage means untested code paths that might contain bugs. Higher coverage gives you confidence that your tests actually validate the code.</p><h3>The Build Pipeline</h3><p>This pipeline creates Docker images and pushes them to a container registry. Every successful merge to main triggers a build. </p><p>Create this file at<code> .github/workflows/build.yml</code>:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!v4CV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b97165a-93b0-4bb4-a84b-79a539eafe74_1520x1974.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!v4CV!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b97165a-93b0-4bb4-a84b-79a539eafe74_1520x1974.png 424w, https://substackcdn.com/image/fetch/$s_!v4CV!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b97165a-93b0-4bb4-a84b-79a539eafe74_1520x1974.png 848w, https://substackcdn.com/image/fetch/$s_!v4CV!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b97165a-93b0-4bb4-a84b-79a539eafe74_1520x1974.png 1272w, https://substackcdn.com/image/fetch/$s_!v4CV!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b97165a-93b0-4bb4-a84b-79a539eafe74_1520x1974.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!v4CV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b97165a-93b0-4bb4-a84b-79a539eafe74_1520x1974.png" width="1456" height="1891" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5b97165a-93b0-4bb4-a84b-79a539eafe74_1520x1974.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1891,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:382295,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpivot.dev/i/179407885?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b97165a-93b0-4bb4-a84b-79a539eafe74_1520x1974.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!v4CV!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b97165a-93b0-4bb4-a84b-79a539eafe74_1520x1974.png 424w, https://substackcdn.com/image/fetch/$s_!v4CV!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b97165a-93b0-4bb4-a84b-79a539eafe74_1520x1974.png 848w, https://substackcdn.com/image/fetch/$s_!v4CV!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b97165a-93b0-4bb4-a84b-79a539eafe74_1520x1974.png 1272w, https://substackcdn.com/image/fetch/$s_!v4CV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b97165a-93b0-4bb4-a84b-79a539eafe74_1520x1974.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Image tagging follows a clear pattern that enables precise version tracking. The main branch gets the <code>latest</code> tag, which always points to the most recent build. Each commit receives a SHA tag like <code>sha-abc123</code>, allowing you to deploy or rollback to any specific commit. Git tags like <code>v1.2.3</code> get semantic version tags following semver conventions. This tagging strategy enables easy rollbacks to specific versions when you need to quickly revert a problematic change.</p><p>The cache configuration dramatically improves build times by reusing layers from previous builds. The first build might take 10 minutes while Docker downloads base images, installs dependencies, and builds your application. Subsequent builds with caching take around 2 minutes because most layers are reused. The<code> cache-from </code>directive tells Docker where to find cached layers. The <code>cache-to</code> directive tells Docker where to store new cached layers. The <code>mode=max</code> option caches as many layers as possible for maximum speed.</p><p>Without caching, your CI/CD pipeline would waste significant time on every build. With caching, builds complete quickly enough that engineers don&#8217;t context switch while waiting. This keeps development velocity high.</p><h3>The Deploy Pipeline</h3><p>This pipeline automatically deploys built images to production after successful builds. It waits for the build to succeed, deploys to your infrastructure, verifies the deployment succeeded, runs smoke tests, and notifies your team. </p><p>Create this file at <code>.github/workflows/deploy.yml</code>:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LMLe!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52a408a9-b60f-455d-8793-380ffc3088ac_1464x2160.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LMLe!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52a408a9-b60f-455d-8793-380ffc3088ac_1464x2160.png 424w, https://substackcdn.com/image/fetch/$s_!LMLe!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52a408a9-b60f-455d-8793-380ffc3088ac_1464x2160.png 848w, https://substackcdn.com/image/fetch/$s_!LMLe!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52a408a9-b60f-455d-8793-380ffc3088ac_1464x2160.png 1272w, https://substackcdn.com/image/fetch/$s_!LMLe!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52a408a9-b60f-455d-8793-380ffc3088ac_1464x2160.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LMLe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52a408a9-b60f-455d-8793-380ffc3088ac_1464x2160.png" width="1456" height="2148" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/52a408a9-b60f-455d-8793-380ffc3088ac_1464x2160.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:2148,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:416119,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpivot.dev/i/179407885?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52a408a9-b60f-455d-8793-380ffc3088ac_1464x2160.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!LMLe!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52a408a9-b60f-455d-8793-380ffc3088ac_1464x2160.png 424w, https://substackcdn.com/image/fetch/$s_!LMLe!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52a408a9-b60f-455d-8793-380ffc3088ac_1464x2160.png 848w, https://substackcdn.com/image/fetch/$s_!LMLe!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52a408a9-b60f-455d-8793-380ffc3088ac_1464x2160.png 1272w, https://substackcdn.com/image/fetch/$s_!LMLe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52a408a9-b60f-455d-8793-380ffc3088ac_1464x2160.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The <code>workflow_run</code> trigger waits for the build to complete successfully before starting deployment. The <code>if</code> condition checks that the build actually succeeded before deploying. This prevents deploying broken images to production.</p><p>The AWS ECS deployment forces a new deployment of service, which pulls the latest image and performs a rolling update. The <code>wait services-stable</code> command blocks until ECS confirms the new tasks are healthy and receiving traffic. This prevents the pipeline from proceeding if the deployment fails.</p><p>Smoke tests verify basic functionality after deployment. These tests should be fast and check critical paths like authentication, database connectivity, and basic API responses. Smoke tests catch obvious problems immediately after deployment while you still have context about what changed.</p><p>The notification step uses <code>if: always()</code> to send Slack messages regardless of success or failure. Success notifications confirm deployments completed successfully. Failure notifications alert the team immediately when problems occur. Having this visibility prevents situations where deployments fail silently and nobody notices until users complain.</p><h3>Pre-commit Hooks</h3><p>Finding problems on your local machine saves time compared to waiting for CI pipelines to run. Pre-commit hooks run formatters, linters, and quick checks locally before you even create a commit.</p><p>Create this file at <code>.pre-commit-config.yaml</code>:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!MQaU!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80512576-9b98-4a53-8012-ab4f280d6fe0_1178x1042.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!MQaU!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80512576-9b98-4a53-8012-ab4f280d6fe0_1178x1042.png 424w, https://substackcdn.com/image/fetch/$s_!MQaU!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80512576-9b98-4a53-8012-ab4f280d6fe0_1178x1042.png 848w, https://substackcdn.com/image/fetch/$s_!MQaU!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80512576-9b98-4a53-8012-ab4f280d6fe0_1178x1042.png 1272w, https://substackcdn.com/image/fetch/$s_!MQaU!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80512576-9b98-4a53-8012-ab4f280d6fe0_1178x1042.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!MQaU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80512576-9b98-4a53-8012-ab4f280d6fe0_1178x1042.png" width="1178" height="1042" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/80512576-9b98-4a53-8012-ab4f280d6fe0_1178x1042.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1042,&quot;width&quot;:1178,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:181910,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpivot.dev/i/179407885?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80512576-9b98-4a53-8012-ab4f280d6fe0_1178x1042.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!MQaU!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80512576-9b98-4a53-8012-ab4f280d6fe0_1178x1042.png 424w, https://substackcdn.com/image/fetch/$s_!MQaU!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80512576-9b98-4a53-8012-ab4f280d6fe0_1178x1042.png 848w, https://substackcdn.com/image/fetch/$s_!MQaU!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80512576-9b98-4a53-8012-ab4f280d6fe0_1178x1042.png 1272w, https://substackcdn.com/image/fetch/$s_!MQaU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80512576-9b98-4a53-8012-ab4f280d6fe0_1178x1042.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Install pre-commit with <code>pip install pre-commit</code> then run <code>pre-commit install</code> in your repository. Now every commit triggers these checks automatically before the commit completes. If checks fail, the commit gets blocked until you fix the issues. This catches formatting inconsistencies, linting violations, and type errors immediately while the code is fresh in your mind.</p><p>Black formats your Python code consistently. This eliminates arguments about code style in pull requests. Everyone&#8217;s code looks the same regardless of personal preferences. Flake8 checks for common Python programming errors and enforces PEP 8 style guidelines. It catches issues like unused imports, undefined variables, and overly complex functions. Mypy performs static type checking, catching type-related bugs before runtime. These tools work together to maintain code quality without you being in the loop.</p><h2>Deployment Strategies</h2><p>Stopping all services, deploying the new version, and bringing everything back up at once is not the best idea. Production systems need sophisticated deployment strategies that minimize risk and enable quick rollbacks. These strategies determine how new versions roll out to users.</p><h3>Blue-Green Deployment</h3><p>Blue-green deployment maintains two complete production environments running at the same time. </p><p>The blue environment runs your current version and serves all production traffic. The green environment sits idle or runs your new version with no traffic. When you deploy, you update green with the new version while blue continues handling users with no interruption. You run your full test suite against green, verifying everything works correctly. You can even have a small number of  internal users test green manually. The load balancer still points at blue during all of this testing so production users never see the new version until you&#8217;re ready.</p><p>Once you&#8217;ve confirmed green is healthy and behaving correctly, you update the load balancer configuration to route traffic to green instead of blue. This switch happens in seconds at the load balancer level. Users never experience downtime because one environment is always serving traffic. The cutover is nearly instantaneous from a user perspective.</p><p>If anything goes wrong after the switch (maybe you found a bug that tests didn&#8217;t catch), you point the load balancer back to blue. The rollback takes seconds because blue is still running with the old version You haven&#8217;t destroyed anything. </p><p>After running green in production successfully for a few hours or days, you can update blue with the next version and repeat the process. This alternating pattern continues: deploy to the idle environment, test it thoroughly, switch traffic, repeat. Each environment alternates between serving production and waiting as the deployment target.</p><p>Here&#8217;s an example script implementing blue-green deployment &#128071;&#127998;</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Gk5D!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfaa4cb1-6219-4d5b-994f-9c3be4d1998a_1464x1414.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Gk5D!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfaa4cb1-6219-4d5b-994f-9c3be4d1998a_1464x1414.png 424w, https://substackcdn.com/image/fetch/$s_!Gk5D!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfaa4cb1-6219-4d5b-994f-9c3be4d1998a_1464x1414.png 848w, https://substackcdn.com/image/fetch/$s_!Gk5D!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfaa4cb1-6219-4d5b-994f-9c3be4d1998a_1464x1414.png 1272w, https://substackcdn.com/image/fetch/$s_!Gk5D!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfaa4cb1-6219-4d5b-994f-9c3be4d1998a_1464x1414.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Gk5D!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfaa4cb1-6219-4d5b-994f-9c3be4d1998a_1464x1414.png" width="1456" height="1406" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dfaa4cb1-6219-4d5b-994f-9c3be4d1998a_1464x1414.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1406,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:280993,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpivot.dev/i/179407885?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfaa4cb1-6219-4d5b-994f-9c3be4d1998a_1464x1414.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Gk5D!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfaa4cb1-6219-4d5b-994f-9c3be4d1998a_1464x1414.png 424w, https://substackcdn.com/image/fetch/$s_!Gk5D!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfaa4cb1-6219-4d5b-994f-9c3be4d1998a_1464x1414.png 848w, https://substackcdn.com/image/fetch/$s_!Gk5D!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfaa4cb1-6219-4d5b-994f-9c3be4d1998a_1464x1414.png 1272w, https://substackcdn.com/image/fetch/$s_!Gk5D!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfaa4cb1-6219-4d5b-994f-9c3be4d1998a_1464x1414.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Blue-green deployment provides several advantages.</p><p>You get zero downtime deployments because one environment always serves traffic. You can test the new version thoroughly before users see it, catching issues in a production-like environment. Rollback happens instantly by switching the load balancer back. </p><p>The strategy is conceptually simple and easy to understand! Making it beginner friendly to teams new to advanced deployment strategies.</p><p>However, the main disadvantage is cost. You&#8217;re running double the infrastructure during deployments and even between deployments if you keep both environments running. For a system using three API instances and three vLLM instances, blue-green means running twelve instances instead of six. That doubles your compute costs.</p><p>Database migrations become complex because both environments must handle the same schema during the transition period. You can&#8217;t make breaking schema changes without coordination between versions. You also can&#8217;t test at real production scale until after the switch, so some issues only surface when all users hit the new version. </p><p>Use blue-green deployment for critical services where downtime is completely unacceptable.</p><p>Financial systems processing payments, healthcare applications managing patient data, and other high stakes services benefit from this approach. The instant rollback capability makes it worth the expense when every minute of downtime has significant business impact.</p><h3>Canary Deployment</h3><p>Canary deployment gradually shifts traffic to the new version by starting with a small percentage of users and increasing over time. Think of it like testing the water before jumping in. You start with 5% of users seeing the new version. If their experience looks good based on metrics, you increase to 25%, then 50%, then 100%. If anything looks wrong at any stage, you immediately route all traffic back to the old version.</p><p>The name comes from a coal mining practice back in the 20th century. Miners brought canary birds into mines because canaries are highly sensitive to toxic gases like carbon monoxide. If the canary becomes sick or dies, miners knew to evacuate immediately (thank goodness we have sensors now).</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ZCaC!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad872780-1f0a-4f95-b4c7-b8243d13c077_2048x2048.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ZCaC!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad872780-1f0a-4f95-b4c7-b8243d13c077_2048x2048.jpeg 424w, https://substackcdn.com/image/fetch/$s_!ZCaC!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad872780-1f0a-4f95-b4c7-b8243d13c077_2048x2048.jpeg 848w, https://substackcdn.com/image/fetch/$s_!ZCaC!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad872780-1f0a-4f95-b4c7-b8243d13c077_2048x2048.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!ZCaC!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad872780-1f0a-4f95-b4c7-b8243d13c077_2048x2048.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ZCaC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad872780-1f0a-4f95-b4c7-b8243d13c077_2048x2048.jpeg" width="1456" height="1456" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ad872780-1f0a-4f95-b4c7-b8243d13c077_2048x2048.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1456,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1162842,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpivot.dev/i/179407885?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad872780-1f0a-4f95-b4c7-b8243d13c077_2048x2048.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ZCaC!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad872780-1f0a-4f95-b4c7-b8243d13c077_2048x2048.jpeg 424w, https://substackcdn.com/image/fetch/$s_!ZCaC!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad872780-1f0a-4f95-b4c7-b8243d13c077_2048x2048.jpeg 848w, https://substackcdn.com/image/fetch/$s_!ZCaC!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad872780-1f0a-4f95-b4c7-b8243d13c077_2048x2048.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!ZCaC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad872780-1f0a-4f95-b4c7-b8243d13c077_2048x2048.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>In deployment terminology, that small percentage of users acts as the canary. You monitor their experience closely to detect problems before affecting everyone else.</p><p>Here&#8217;s how the deployment process works. You deploy the new version alongside the old version. Initially, 95% of users continue hitting the old version while only 5% hit the new version. You monitor error rates, latency percentiles, and business metrics for both versions. If the canary group shows similar or better metrics than the control group, you increase the percentage. If the canary group shows worse metrics (higher error rates, slower response times, fewer successful transactions), you abort the rollout and route everyone back to the old version.</p><p>This gradual rollout continues in stages. At each stage, you wait a predetermined time (perhaps 10-15 minutes) while monitoring metrics. The waiting period lets issues surface before you expose more users. After confirming metrics look healthy, you proceed to the next stage. The final stage routes 100% of traffic to the new version and removes the old version.</p><p>Here&#8217;s an example bash script that implements a canary deployment &#128071;&#127998;</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!BWl4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde2b5322-9e71-4fe8-8006-7a71b3ce209b_1328x1712.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!BWl4!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde2b5322-9e71-4fe8-8006-7a71b3ce209b_1328x1712.png 424w, https://substackcdn.com/image/fetch/$s_!BWl4!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde2b5322-9e71-4fe8-8006-7a71b3ce209b_1328x1712.png 848w, https://substackcdn.com/image/fetch/$s_!BWl4!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde2b5322-9e71-4fe8-8006-7a71b3ce209b_1328x1712.png 1272w, https://substackcdn.com/image/fetch/$s_!BWl4!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde2b5322-9e71-4fe8-8006-7a71b3ce209b_1328x1712.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!BWl4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde2b5322-9e71-4fe8-8006-7a71b3ce209b_1328x1712.png" width="1328" height="1712" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/de2b5322-9e71-4fe8-8006-7a71b3ce209b_1328x1712.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1712,&quot;width&quot;:1328,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:335220,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpivot.dev/i/179407885?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde2b5322-9e71-4fe8-8006-7a71b3ce209b_1328x1712.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!BWl4!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde2b5322-9e71-4fe8-8006-7a71b3ce209b_1328x1712.png 424w, https://substackcdn.com/image/fetch/$s_!BWl4!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde2b5322-9e71-4fe8-8006-7a71b3ce209b_1328x1712.png 848w, https://substackcdn.com/image/fetch/$s_!BWl4!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde2b5322-9e71-4fe8-8006-7a71b3ce209b_1328x1712.png 1272w, https://substackcdn.com/image/fetch/$s_!BWl4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde2b5322-9e71-4fe8-8006-7a71b3ce209b_1328x1712.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>You can automate canary monitoring with Python.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Md3H!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe697b5c7-165d-4e03-bb8f-39a10f006753_1414x1750.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Md3H!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe697b5c7-165d-4e03-bb8f-39a10f006753_1414x1750.png 424w, https://substackcdn.com/image/fetch/$s_!Md3H!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe697b5c7-165d-4e03-bb8f-39a10f006753_1414x1750.png 848w, https://substackcdn.com/image/fetch/$s_!Md3H!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe697b5c7-165d-4e03-bb8f-39a10f006753_1414x1750.png 1272w, https://substackcdn.com/image/fetch/$s_!Md3H!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe697b5c7-165d-4e03-bb8f-39a10f006753_1414x1750.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Md3H!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe697b5c7-165d-4e03-bb8f-39a10f006753_1414x1750.png" width="1414" height="1750" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e697b5c7-165d-4e03-bb8f-39a10f006753_1414x1750.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1750,&quot;width&quot;:1414,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:385024,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpivot.dev/i/179407885?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe697b5c7-165d-4e03-bb8f-39a10f006753_1414x1750.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Md3H!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe697b5c7-165d-4e03-bb8f-39a10f006753_1414x1750.png 424w, https://substackcdn.com/image/fetch/$s_!Md3H!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe697b5c7-165d-4e03-bb8f-39a10f006753_1414x1750.png 848w, https://substackcdn.com/image/fetch/$s_!Md3H!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe697b5c7-165d-4e03-bb8f-39a10f006753_1414x1750.png 1272w, https://substackcdn.com/image/fetch/$s_!Md3H!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe697b5c7-165d-4e03-bb8f-39a10f006753_1414x1750.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Canary deployment provides several benefits. It carries lower risk than immediate full rollout because only a small percentage of users see the new version initially. It tests at real production scale with real user behavior patterns, catching issues that staging environments miss. Problems surface early when only 5-10% of users are affected rather than impacting everyone. You don&#8217;t need double infrastructure like blue-green deployment, keeping costs manageable.</p><p>The approach has some disadvantages. The complexity is higher than blue-green because you&#8217;re managing gradual rollout percentages and monitoring at each stage. Some users in the canary group might get a worse experience if the new version has issues, though this affects fewer people than a full rollout would. Good monitoring is absolutely essential because you can&#8217;t detect problems without it. And the deployment process takes longer than a simple cutover.</p><p>Overall the canary deployment provides a good balance between safety and practicality.</p><h3>Rolling Deployment</h3><p>Rolling deployment updates instances one at a time. With four API instances, you update the first instance while the other three continue serving traffic normally. Once the first instance is healthy and passing health checks, you update the second instance. You continue this pattern until all instances run the new version.</p><p>This strategy is the simplest to implement and understand. You start with instance one, update it with the new version, wait for health checks to pass, then move to instance two. The process continues sequentially through all instances. The load balancer automatically routes traffic away from instances during their update and back to them once they&#8217;re healthy.</p><p>You don&#8217;t need additional infrastructure. The deployment happens gradually, giving you opportunities to pause if problems appear. During the rollout, most of your capacity remains on the old version until near the end. If you notice issues during deployment, you can stop the rollout and investigate. Kubernetes actually implements rolling updates by default, making this approach accessible to teams already using Kubernetes.</p><p>The main issue with rolling deployments is version mixing. Both versions run simultaneously during deployment, which can cause problems. Imagine your new version expects a database field that doesn&#8217;t exist yet. Or maybe the new version changes an API response format that the old version can&#8217;t handle. These version compatibility issues require careful consideration.</p><p>Database migrations become tricky when different code versions expect different schemas. You need backward-compatible database changes. </p><p>The deployment is slower than blue-green cutover but faster than canary. </p><p>You can use rolling deployment as your default strategy when starting out. It&#8217;s simple to implement and understand. The gradual rollout provides some safety without requiring complex infrastructure or sophisticated monitoring. Many teams start here and slowly move up to canary deployment after establishing good monitoring.</p><h3>Comparing and Choosing Strategies </h3><p>Your requirements determine which strategy fits best. Let me walk through a decision framework.</p><p>If zero downtime is absolutely required, use blue-green or canary deployment. Rolling deployment might cause brief traffic disruptions during instance restarts, though this depends on your load balancer configuration.</p><p>If you need instant rollback capability, use blue-green deployment. The idle environment provides a perfect rollback target. Canary and rolling deployments require rolling back through your CI/CD pipeline, which takes longer.</p><p>If budget is limited and you want to avoid double infrastructure, use rolling or canary deployment. Blue-green requires maintaining two complete environments, doubling your infrastructure costs during and between deployments.</p><p>If you want to test at real scale before full rollout, use canary deployment. Blue-green tests in a production environment but not at production scale until after the switch. Rolling deployment provides no isolated testing.</p><p>If you want something simple to start with and plan to evolve later, use rolling deployment. It requires minimal infrastructure changes and no sophisticated monitoring initially.</p><h2>Production Readiness Requirements</h2><p>Before launching to production, your system must meet minimum requirements. These items are non-negotiable for production deployment. Skipping them can lead to major problems.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!o1c-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc00edec0-0672-4c63-96b8-da1de3e659f0_6720x4480.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!o1c-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc00edec0-0672-4c63-96b8-da1de3e659f0_6720x4480.jpeg 424w, https://substackcdn.com/image/fetch/$s_!o1c-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc00edec0-0672-4c63-96b8-da1de3e659f0_6720x4480.jpeg 848w, https://substackcdn.com/image/fetch/$s_!o1c-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc00edec0-0672-4c63-96b8-da1de3e659f0_6720x4480.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!o1c-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc00edec0-0672-4c63-96b8-da1de3e659f0_6720x4480.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!o1c-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc00edec0-0672-4c63-96b8-da1de3e659f0_6720x4480.jpeg" width="1456" height="971" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c00edec0-0672-4c63-96b8-da1de3e659f0_6720x4480.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:971,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:15750582,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpivot.dev/i/179407885?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc00edec0-0672-4c63-96b8-da1de3e659f0_6720x4480.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!o1c-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc00edec0-0672-4c63-96b8-da1de3e659f0_6720x4480.jpeg 424w, https://substackcdn.com/image/fetch/$s_!o1c-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc00edec0-0672-4c63-96b8-da1de3e659f0_6720x4480.jpeg 848w, https://substackcdn.com/image/fetch/$s_!o1c-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc00edec0-0672-4c63-96b8-da1de3e659f0_6720x4480.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!o1c-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc00edec0-0672-4c63-96b8-da1de3e659f0_6720x4480.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Your infrastructure needs at least two instances for redundancy. Single instances create a single point of failure. When that instance crashes due to an unexpected bug or infrastructure issue, your entire service goes down. Users get errors. Revenue stops. Metrics spike. Two instances mean one can fail while the other continues serving traffic. Three to five instances provide even better redundancy and handle traffic distribution more smoothly.</p><p>The load balancer must distribute traffic across these instances properly. Configure health check endpoints that the load balancer queries every few seconds. When an instance fails health checks, the load balancer stops routing traffic to it automatically. This prevents users from hitting broken instances. The load balancer should also handle SSL/TLS termination so your application doesn&#8217;t worry about certificate management. </p><p>Configure SSL/TLS certificates because serving unencrypted traffic in production is unacceptable in modern systems. Users expect HTTPS. Browsers show scary warnings for HTTP sites. Also, search engines penalize sites without HTTPS. Let&#8217;s Encrypt provides free certificates that auto-renew, eliminating cost as an excuse. AWS Certificate Manager also provides free certificates for AWS infrastructure. There&#8217;s no reason to skip this.</p><p>Configure DNS to give your service a domain name instead of requiring users to remember IP addresses. Configure A records pointing to your load balancer. Set up CNAME records for any subdomains. Test DNS resolution from multiple locations. DNS problems are frustrating to debug because of caching at multiple levels.</p><p>Security requires several configurations beyond just HTTPS. Remove all secrets from code completely! Even comments containing old API keys are dangerous. Use environment variables or a secrets management service like AWS Secrets Manager or HashiCorp Vault. These services encrypt secrets at rest and provide audit logs of secret access.</p><p>Configure rate limiting to prevent any single user from consuming all your resources. A misconfigured client making thousands of requests per second shouldn&#8217;t take down your entire service. Rate limiting protects you from both malicious attacks and accidental misuse.</p><p>Track GPU utilization to see whether you&#8217;re using GPU resources efficiently. If GPU utilization sits at 30%, you&#8217;re wasting money. If it constantly hits 100%, you need more GPUs. Request rates show traffic patterns and help you understand usage trends. Error rates indicate problems before users complain. A sudden spike in errors signals something broke. Track these metrics continuously and set up basic alerts &#128200;.</p><p>Create alerts for critical failure conditions. Alert when vLLM crashes completely so you know immediately. Alert when error rates exceed 5% so you can investigate before too many users are affected. Alert when latency exceeds acceptable thresholds so you can scale before users complain about slowness. Start with these basic alerts and add more as you learn what matters for your system.</p><p>Automate your deployment process with a CI/CD pipeline. Manual deployments don&#8217;t scale and introduce human error. The pipeline should run tests, build images, deploy to production, and verify success automatically. Document each step even though it&#8217;s automated. Documentation helps when the pipeline breaks or when you need to run steps manually during debugging.</p><p>Document the rollback procedure in detail and actually test it before you need it in an emergency. Rolling back under pressure is stressful. Having clear documentation and practiced procedures reduces that stress. Test rollbacks in a staging environment regularly to verify they work and to keep the team familiar with the process.</p><p>Implement post-deployment smoke tests that verify basic functionality after each deployment. These tests should be fast and check critical functionality. Can users authenticate? Can the service connect to the database? Do the main API endpoints respond correctly? Smoke tests catch obvious deployment problems immediately while you still have context about what changed.</p><p>Within the first month after launch, add more sophistication to your system. Configure auto-scaling so your infrastructure grows automatically with demand. This prevents manual scaling operations at 3 PM when traffic suddenly spikes. Auto-scaling based on CPU utilization, request rates, or custom metrics keeps your service responsive without constant manual intervention.</p><p>Set up proper network security with firewalls and security groups. Restrict access to internal services. Only expose what needs to be public. Close unnecessary ports. Use private subnets for database and internal services. These security layers provide defense in depth against attacks.</p><p>Track latency at different percentiles to understand user experience. The median (p50) shows typical performance. The 95th percentile (p95) shows what slower users experience. The 99th percentile (p99) shows the worst-case experience. A service with good p50 latency but terrible p99 latency frustrates some users even if most users are happy &#9757;&#127998;.</p><p>Monitor costs so you don&#8217;t get surprised by your cloud bill at month end. Set up billing alerts that notify you when spending exceeds expected amounts. Tag resources appropriately so you can attribute costs to specific services or teams. Review the bill monthly to identify optimization opportunities.</p><p>Create dashboards for all key metrics so anyone can see system health at a glance. The dashboard should show request rates, error rates, latency percentiles, GPU utilization, and fallback rates. Grafana dashboards work well for this. Share dashboard links with your team so everyone knows where to look when investigating issues.</p><p>Add circuit breakers for external dependencies like OpenAI so cascading failures don&#8217;t take down your entire system. When OpenAI is slow or returning errors, the circuit breaker opens and stops sending requests. This prevents your service from waiting indefinitely for responses that will never come.</p><p>Verify the OpenAI fallback actually works under production conditions by deliberately testing failure scenarios. Turn off vLLM in staging and verify traffic routes to OpenAI correctly. Check that the circuit breaker opens appropriately. Test that recovery works when vLLM comes back online.</p><p>Achieve 80% or higher test coverage to catch regression bugs early. Lower coverage means untested code paths that might contain bugs. Higher coverage gives you confidence that your tests actually validate the code. Focus coverage on critical paths first, then expand to edge cases.</p><p>Run load tests at expected production scale to find performance bottlenecks before real users hit them. For example, simulate 7 requests per second (your expected production load) against your staging environment. Monitor response times, error rates, and resource utilization. Fix any problems you discover before launching.</p><p>Operations need documentation for when things inevitably break. Write a runbook covering common issues and their solutions. The runbook should include symptoms, diagnostic steps, and remediation steps for each issue. Update the runbook after every incident with lessons learned.</p><p>Eventually you might add features that provide extra value but aren&#8217;t critical for launch. Multi-region deployment improves latency for global users and provides geographic redundancy. Deploying in US East, US West, and Europe regions means users in each geography get fast response times.</p><p>Distributed tracing helps debug complex issues that span multiple services. Tracing shows how a single request flows through your system, identifying where time is spent and where errors occur. This becomes invaluable as your system grows more complex.</p><p>Don&#8217;t try to implement everything on day one. That leads to scope creep, burnout, and delays to launch. Focus on the essentials first. Add sophistication as you learn what your system actually needs in production. Real usage patterns reveal requirements better than upfront speculation.</p><h2>Creating Effective Runbooks</h2><p>A runbook documents how to operate your system when things break. It&#8217;s like a detailed instruction manual for fixing problems.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Bj3G!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e58c719-7840-4bfd-9d42-70ec3653d3aa_7000x3937.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Bj3G!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e58c719-7840-4bfd-9d42-70ec3653d3aa_7000x3937.jpeg 424w, https://substackcdn.com/image/fetch/$s_!Bj3G!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e58c719-7840-4bfd-9d42-70ec3653d3aa_7000x3937.jpeg 848w, https://substackcdn.com/image/fetch/$s_!Bj3G!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e58c719-7840-4bfd-9d42-70ec3653d3aa_7000x3937.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!Bj3G!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e58c719-7840-4bfd-9d42-70ec3653d3aa_7000x3937.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Bj3G!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e58c719-7840-4bfd-9d42-70ec3653d3aa_7000x3937.jpeg" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1e58c719-7840-4bfd-9d42-70ec3653d3aa_7000x3937.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:14333421,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpivot.dev/i/179407885?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e58c719-7840-4bfd-9d42-70ec3653d3aa_7000x3937.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Bj3G!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e58c719-7840-4bfd-9d42-70ec3653d3aa_7000x3937.jpeg 424w, https://substackcdn.com/image/fetch/$s_!Bj3G!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e58c719-7840-4bfd-9d42-70ec3653d3aa_7000x3937.jpeg 848w, https://substackcdn.com/image/fetch/$s_!Bj3G!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e58c719-7840-4bfd-9d42-70ec3653d3aa_7000x3937.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!Bj3G!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e58c719-7840-4bfd-9d42-70ec3653d3aa_7000x3937.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Start with quick links to essential resources at the top of the runbook. Include the Grafana dashboard URL so you can immediately see what metrics look wrong. Include the log aggregation interface (Kibana, CloudWatch, or whatever you use) so you can search for error messages. Include the on-call schedule so you know who to contact for help. </p><p>Document common issues with clear symptoms, diagnostic steps, and remediation steps. Each issue should follow a standard template that makes information easy to find under pressure.</p><h4>Example 1: vLLM Outages</h4><p>For complete vLLM outages, the symptoms include 100% error rate on the monitoring dashboard, all API requests failing with timeout errors, and constant alert notifications flooding your phone. The diagnostic process involves checking the vLLM health endpoint at <code>/health</code> to see if it responds. Run <code>nvidia-smi</code> on the GPU host to verify GPU availability and look for GPU errors. Check that the vLLM container is actually running with <code>docker ps</code> or <code>kubectl get pods</code>. Look at vLLM logs for error messages that explain what went wrong.</p><p>The remediation steps start with restarting vLLM using your container orchestration tool. For Docker Compose, run <code>docker-compose restart vllm</code>. For Kubernetes, run <code>kubectl rollout restart deployment/vllm</code>. Verify GPUs are available and accessible after restart. Check that the circuit breaker routes traffic to OpenAI while vLLM is down so users aren&#8217;t completely blocked. If the OpenAI fallback also fails, escalate to senior engineers or management because you have a bigger problem than just vLLM.</p><h4>Example 2: Slow Performance</h4><p>For slow performance where the system is working but users complain about speed, symptoms include p99 latency exceeding 3 seconds on dashboards, user complaints in support channels, and requests timing out occasionally. Diagnose by checking GPU utilization with <code>nvidia-smi</code>. If GPU utilization is constantly at 100%, you don&#8217;t have enough GPU capacity. Examine queue depth in metrics to see how many requests are waiting. If queue depth keeps growing, requests are arriving faster than you can process them. Look for memory leaks where memory usage creeps up over time on host metrics.</p><p>Fix high GPU utilization by scaling horizontally with additional instances. For Kubernetes, run <code>kubectl scale deployment/vllm --replicas=5</code> to add more capacity. For AWS ECS, update the service&#8217;s desired count. Address large queue depths the same way because more instances provide more processing capacity. If you suspect a memory leak (memory usage growing steadily without returning), restart vLLM to clear memory temporarily, then file a bug report and monitor for recurrence.</p><h4>Example 3: OOM Errors</h4><p>Out of memory errors cause vLLM to crash with &#8220;CUDA out of memory&#8221; messages in logs. This usually happens when requests are too large for available GPU memory. The immediate fix involves lowering the <code>max-model-len</code> configuration parameter in your vLLM startup script. Reduce <code>gpu-memory-utilization</code> to leave more buffer for large requests. Restart vLLM to apply configuration changes. If problems persist, consider upgrading to a larger GPU instance type like AWS p4d instances that offer more memory.</p><h4>Example 4: Unreliable Service</h4><p>High OpenAI costs indicate your fallback is being used too frequently, which means vLLM is unreliable. Check the fallback rate metric. Investigate why vLLM is unstable by looking at instance health, error logs, and recent deployments. Examine the circuit breaker state to see if it opened due to detected problems. Fix the underlying vLLM issues causing the high fallback rate. Reset the circuit breaker if needed after fixing the root cause.</p><h2>Looking Forward: Next Steps</h2><p>After implementing everything in this series, you have a complete production system. You&#8217;ve built type-safe LLM applications using BAML. The test coverage from Part 3 catches regression bugs before they reach users. The self-hosted vLLM infrastructure from Part 4 reduces costs when the economics make sense. Monitoring and alerts tell you what&#8217;s happening. CI/CD pipelines deploy consistently without manual steps. Runbooks guide your team through incidents. You can deploy confidently because the infrastructure and processes support production traffic reliably.</p><p>At 5 million requests per month, the economics clearly favor self-hosting. Using OpenAI API exclusively costs $13,750 per month or $165,000 annually. Self-hosting costs $727 per month for infrastructure ($8,724 annually) plus roughly $4,000-5,000 monthly in engineering time for maintenance and incidents. Net savings remain around $150,000 per year even after accounting for engineering overhead.</p><p>However, self-hosting doesn&#8217;t make sense at all volumes. Below 200,000 requests monthly, just use the API. The engineering overhead isn&#8217;t justified by the cost savings. Between 200,000 and 500,000 requests, the API is probably still the right choice depending on your team&#8217;s capabilities and available engineering time. Above 500,000 requests monthly, self-hosting starts making financial sense. Above 5 million requests, self-hosting is definitely worth the operational complexity!</p><p>As your system grows, you&#8217;ll need more advanced infrastructure. Deploy vLLM in multiple AWS regions (US East, US West, Europe) and route users to the closest one. This cuts latency for global users but requires more complex traffic routing and data synchronization.</p><p>Kubernetes handles more complex orchestration than Docker Compose. You get horizontal pod autoscalers, native rolling updates, and granular resource management. The tradeoff is operational complexity. Learning and maintaining Kubernetes takes time, but it&#8217;s worth it once your system reaches a certain scale.</p><p>Model fine-tuning becomes relevant when open source models aren&#8217;t quite good enough for your specific use case. Fine-tuning requires ML expertise to design training approaches, GPU time for training runs, and data preparation infrastructure. However, fine-tuned models can dramatically improve quality for domain-specific applications. The investment pays off when model quality directly impacts user satisfaction or business metrics.</p><p>The LLM space moves fast. New models drop frequently. vLLM releases updates every few weeks. Stay current by following the vLLM GitHub for releases, checking HuggingFace for new models, and joining communities like r/LocalLLaMA where people share what actually works in production.</p><p>Upgrade thoughtfully. Test new vLLM versions in staging every 2-3 months before rolling them to production. Try new models when the quality improvement justifies the migration risk. Upgrade infrastructure when the benefits clearly outweigh the cost and complexity of migration </p><h2>Conclusion</h2><p>This series covered everything you need to run LLMs in production. Type-safe code with BAML. Tests that catch bugs before users do. Monitoring that tells you what&#8217;s happening. CI/CD that deploys consistently. Deployment strategies that limit risk.</p><p>Building production systems is hard work. It requires infrastructure knowledge, software skills, and operational discipline. But there&#8217;s real satisfaction in building something that works reliably at scale.</p><p>Happy building &#128296;</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://mlpivot.dev/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Self-Hosting LLMs with vLLM]]></title><description><![CDATA[Get ready to save some money &#128176;. In this post, you'll learn how to set up your own LLM server using vLLM, choose the right models, and build an architecture that fits your use case.]]></description><link>https://mlpivot.dev/p/self-hosting-llms-with-vllm</link><guid isPermaLink="false">https://mlpivot.dev/p/self-hosting-llms-with-vllm</guid><dc:creator><![CDATA[The Machine Learning Pivot]]></dc:creator><pubDate>Tue, 25 Nov 2025 22:42:36 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/76c54188-f8c8-4dfb-a8cc-b36dc0ce55e3_2048x2048.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Three months into production, I opened my OpenAI dashboard. $550 for the month. Not terrible, but we were only handling 200K requests. Our product manager had just shown me the growth projections: 2M requests by Q2, 5M by Q3.</p><p>Quick math: at $0.00275 per request (gpt-4o), that&#8217;s $5.5K/month by Q2, scaling to $13.75K/month by Q3. $165K annually on API calls alone.</p><p>My manager asked: &#8220;Can we run this ourselves?&#8221;</p><p>My first thought: &#8220;Self-hosting sounds expensive and complicated.&#8221;</p><p>Turns out I was wrong about the expensive part. At our scale, self-hosting would cost about $700/month. That&#8217;s over $156K in annual savings at our projected scale.</p><h2>When Self-Hosting Makes Sense (And When It Doesn&#8217;t)</h2><p>Let&#8217;s do the math first.</p><h3>The Cost Breakdown</h3><h4>OpenAI API costs (GPT-4o):</h4><p><em>Assuming ~700 input tokens + 100 output tokens per request.</em></p><h5>Cost per request calculation:</h5><ul><li><p>Input 700 tokens &#215; $0.0025/1K = $0.00175</p></li><li><p>Output: 100 tokens &#215; $0.010/1K = $0.001</p></li><li><p>Total: ~$0.00275 per request</p></li></ul><blockquote><p><strong>&#128221; Note:</strong> Your token counts will vary based on your specific use case.</p></blockquote><ul><li><p>Cost per request: ~$0.00275</p></li><li><p>100K requests/month: $275</p></li><li><p>500K requests/month: $1,375</p></li><li><p>1M requests/month: $2,750</p></li><li><p>5M requests/month: $13,750</p></li></ul><h4>Self-hosted vLLM costs (1-Year Reserved Instance):</h4><p><em>Assuming 24/7 usage.</em></p><ul><li><p>AWS g5.xlarge (1-Year RI): $0.634/hour &#215; 720 hours = $456.48/month</p></li><li><p>storage: $50/month</p></li><li><p>Network transfer: $20/month</p></li><li><p>Infrastructure Total: ~$527/month</p></li></ul><blockquote><p><strong>&#128221; Note:</strong> On-Demand pricing is $1.006/hour ($724/month), so the 1-year RI saves you $267/month compared to on-demand.</p></blockquote><p>Add maybe $200/month for engineering time (monitoring, maintenance). You&#8217;re looking at $727/month total.</p><p>Break-even point is ~264K requests/month.</p><p>Below that? Stick with the API. Above that? Self-hosting saves you money.</p><h3>When to Self-Host</h3><h4>High Request Volume</h4><p>With self-hosting, the costs don&#8217;t scale linearly like API costs do. With OpenAI, if you double your requests, you double your bill. But with self-hosting, your infrastructure costs stay roughly the same whether you&#8217;re processing 300K requests or 3M requests per month (until you need to scale horizontally).</p><p>Let&#8217;s walk through what this actually looks like at different volumes.</p><h5>At 500K requests/month</h5><p><strong>OpenAI API Cost</strong></p><ul><li><p>500,000 requests &#215; $0.00275 per request = $1,375/month</p></li><li><p>Annual cost: $1,375 &#215; 12 = $16,500/year</p></li></ul><p><strong>Self-Hosted Cost</strong></p><ul><li><p>Infrastructure: $527/month (this stays constant)</p></li><li><p>Engineering time: $200/month (this stays constant)</p></li><li><p>Total: $727/month</p></li><li><p>Annual cost: $727 &#215; 12 = $8,724/year</p></li></ul><p>Annual savings: $16,500 - $8,724 = <strong>$7,776/year</strong></p><p>Not bad, but is it worth the operational complexity? Maybe, maybe not.</p><h5>At 2M <strong>requests/mo</strong>nth (projected Q2 volume)</h5><p><strong>OpenAI API Cost</strong></p><ul><li><p>2,000,000 requests &#215; $0.00275 per request = $5,500/month</p></li><li><p>Annual cost: $5,500 &#215; 12 = $66,000/year</p></li></ul><p><strong>Self-Hosted Cost</strong></p><ul><li><p>Infrastructure: $527/month (still the same!)</p></li><li><p>Engineering time: $200/month (still the same!)</p></li><li><p>Total: $727/month</p></li><li><p>Annual cost: $727 &#215; 12 = $8,724/year</p></li></ul><p>Annual savings: $66,000 - $8,724 = <strong>$57,276/year</strong></p><p>Now we&#8217;re talking. That&#8217;s a junior engineer&#8217;s salary you&#8217;re saving!</p><h5>At 5M requests/month (projected Q3 volume)</h5><p><strong>OpenAI API Cost</strong></p><ul><li><p>5,000,000 requests &#215; $0.00275 per request = $13,750/month</p></li><li><p>Annual cost: $13,750 &#215; 12 = $165,000/year</p></li></ul><p><strong>Self-Hosted Cost</strong></p><ul><li><p>Infrastructure: $527/month (yep, still the same)</p></li><li><p>Engineering time: $200/month (still the same)</p></li><li><p>Total: $727/month</p></li><li><p>Annual cost: $727 &#215; 12 = $8,724/year</p></li></ul><p>Annual savings: $165,000 - $8,724 = <strong>$156,276/year</strong></p><p>That&#8217;s not a typo. You&#8217;re saving $156K per year at that scale.</p><h5>Why Does this Matter?</h5><p>Notice how our self-hosted costs stayed at $727/month across all three scenarios. That&#8217;s important! A single g5.xlarge instance can handle way more than 5M requests per month (with proper batching and optimization). Your costs are fixed until you need to scale horizontally.</p><p>Meanwhile, OpenAI&#8217;s costs scale perfectly with your usage. Great when you&#8217;re small, painful when you&#8217;re big.</p><p>This is why I say self-hosting becomes economical around 500K+ requests monthly. Below that, the savings aren&#8217;t dramatic enough to justify the operational overhead. Above that, the savings accelerate fast.</p><p>At our current trajectory (200K requests now, 5M by Q3), we&#8217;d be leaving $156K on the table by staying with the API. That pays for a lot of engineering time.</p><p><strong>Rule of thumb:</strong> If your monthly request volume varies by more than 50%, you&#8217;ll want to either use on-demand instances (which costs more but scales with usage) or calculate your average monthly volume across the year.</p><h4>Data Privacy Requirements and Custom Models</h4><p>Let&#8217;s say you work at a healthcare company building an AI assistant that helps doctors write patient notes. Your training data includes thousands of real patient records with medical jargon, treatment patterns, and clinical language specific to your hospital system.</p><p>Unfortunately, you can&#8217;t upload that data to OpenAI&#8217;s servers to fine-tune GPT-4. Why? Because it&#8217;s protected health information under HIPAA. Sending patient data to a third party (even encrypted) violates compliance requirements. Your legal team would have a meltdown.</p><p>So what do you do? You need to fine-tune a model on your proprietary data, but that data legally cannot leave your infrastructure. The only solution is to download an open source model (like Llama-2 or Mistral), fine-tune it on your own servers with your own data, and run it entirely within your infrastructure. Self-hosting isn&#8217;t optional here. It&#8217;s the only way to do this legally.</p><h5>Scenarios Where This Matters</h5><p><strong>Healthcare:</strong> You need a model that understands your hospital&#8217;s specific terminology, drug protocols, and documentation style. Training data includes patient records that must stay within your HIPAA-compliant infrastructure.</p><p><strong>Financial Services:</strong> You&#8217;re building a model to detect fraudulent transactions based on your bank&#8217;s historical transaction patterns. That data includes customer financial information that can&#8217;t be sent to external APIs under PCI DSS regulations.</p><p><strong>Legal Firms:</strong> You want a model trained on your firm&#8217;s past case documents, legal strategies, and client communications. Attorney-client privilege means this data absolutely cannot leave your servers.</p><p><strong>Manufacturing:</strong> You&#8217;re creating a model to predict equipment failures based on proprietary sensor data from your factory floor. Your competitors would love to see that data. It&#8217;s a trade secret that stays internal.</p><p><strong>Specialized Models Not Available via API:</strong> Sometimes the model you need just doesn&#8217;t exist as an API. In these cases, you&#8217;ll often find an open source base model and fine-tune it yourself. But once you&#8217;ve done that work, there&#8217;s no API to call. The model only exists on your infrastructure. You have to self-host it.</p><h4>Latency-Sensitive Applications</h4><p>When you call the OpenAI API, your request travels across the internet to OpenAI&#8217;s servers, waits in the queue, gets processed, and then the response travels back to you. The entire journey (what we call the &#8220;roundtrip&#8221;) takes 1-2 seconds on average.</p><p>With local inference using vLLM, your request stays inside your own infrastructure. No network hops, no external queues. The model responds in 200-500ms (that&#8217;s 0.2 to 0.5 seconds).</p><p>That might not sound like a huge difference, but think about typing in a chatbot. If every response takes 2 seconds, the conversation feels sluggish. Users notice. They get impatient. They leave.</p><p>But when responses come back in 300ms? The conversation flows naturally. It feels responsive. Users stay engaged.</p><p>This matters most for:</p><ul><li><p>Real-time chat applications where users expect instant responses</p></li><li><p>Code autocomplete (imagine waiting 2 seconds every time you hit tab)</p></li><li><p>Interactive demos or customer support bots</p></li><li><p>Any feature where latency directly impacts user experience</p></li></ul><p>If you&#8217;re building a batch processing system that analyzes resumes overnight, that extra 1-2 seconds per request doesn&#8217;t matter. But if you&#8217;re building a feature where users are actively waiting for responses? Every millisecond counts.</p><p><strong>Latency test:</strong> Before committing to self-hosting for latency reasons, measure your current p95 and p99 latencies. Then ask: what would a 3-5x improvement actually change for your users? If the answer is &#8220;not much,&#8221; stick with the API.</p><h3>When NOT to Self-Host</h3><h4>Low Volume</h4><p>At 200K requests/month, you&#8217;re spending $550/month with the OpenAI API. Self-hosting costs $727/month. You&#8217;d be losing $177/month by self-hosting.</p><p>The break-even point is around 264K requests/month. Below that, the API is literally cheaper.</p><h5>No Upfront Investment:</h5><p>With an API, you create an account, get an API key, and start making requests in under 10 minutes.</p><p>With self-hosting, you need to provision a GPU instance, download models, write Docker configs, set up monitoring. That&#8217;s days of engineering work before your first request. And you&#8217;re paying for that GPU the entire time.</p><h5>Pay-as-you-go:</h5><p>With the API, the provider handles model updates, infrastructure scaling, security patches, and monitoring. While your team focuses on building features.</p><p>With self-hosting, your team handles GPU monitoring, service restarts, out-of-memory errors, version updates, security patches, and on-call rotations when things break.</p><p>Start with the API then switch to self-hosting when you&#8217;ve hit your break-even point.</p><h4>Early Prototyping</h4><p>When you&#8217;re building a new feature, you don&#8217;t know if it&#8217;ll work and you need to test fast.</p><p>With the OpenAI API, you&#8217;re testing your idea in 10 minutes:</p><pre><code>import openai

response = openai.ChatCompletion.create(
    model=&#8221;gpt-4&#8221;,
    messages=[{&#8221;role&#8221;: &#8220;user&#8221;, &#8220;content&#8221;: &#8220;Your prompt here&#8221;}]
)</code></pre><p>Done. No infrastructure setup, no model downloads, no GPU provisioning. Just an API key is needed.</p><h4>Need Cutting-Edge Performance</h4><p>Open source models lag 6-12 months behind frontier models. When GPT-5 launched, the best open source models are still catching up to GPT-4&#8217;s capabilities from 2023. If you need the absolute best performance, stick with APIs.</p><h2>What is vLLM?</h2><p>Before vLLM, people used tools like Ollama for local LLM development. Ollama is great for tinkering on your laptop. Easy setup, good for learning. But Ollama has limitations for production: memory inefficient, static batching, throughput plateaus under load.</p><p>vLLM rethought the architecture from scratch. It&#8217;s a high-performance inference engine built by researchers at UC Berkeley. Used in production by Databricks, Anthropic, and others.</p><h4>The Core Problem vLLM Solves</h4><p>Most LLM servers waste 60-80% of available GPU memory through pre-allocation. </p><p>Traditional servers look at the maximum possible memory a request could need (let&#8217;s say 8GB) and reserve the entire 8GB upfront, even though most requests only use 2-3GB. The other 5-6GB sits there unused and locked. With a 24GB GPU, you can only handle 3 concurrent requests because you&#8217;ve pre-allocated everything.</p><p>vLLM allocates memory dynamically in small blocks (similar to virtual memory in operating systems). It gives each request exactly what it needs, when it needs it. That same 24GB GPU now handles 12 concurrent requests instead of 3.</p><p>You can now serve 2-4x more requests on the same hardware.</p><h3>vLLM&#8217;s Key Innovation</h3><h5>PagedAttention</h5><p>Inspired by virtual memory in operating systems. Instead of allocating huge chunks of memory upfront, it breaks attention computation into small pages and blocks. This lets vLLM allocate memory dynamically as requests actually need it.</p><h5>Continuous Batching</h5><p>Traditional batching waits for a batch to fill up (say, 10 requests), processes them all together, then waits for the slowest request to finish before starting the next batch. Think of it like standard college admissions, where the next cycle doesn&#8217;t start until every request from the current cycle has been processed.</p><p>Continuous batching processes multiple requests together, but as soon as one request finishes generating its response, vLLM immediately adds a new request to fill that spot (think of it like rolling admissions). The batch size stays constant, but requests flow in and out continuously. </p><p>The result is higher throughput and lower latency because you&#8217;re not wasting GPU cycles waiting for every request to finish.</p><h5>Optimized CUDA Kernels</h5><p>When you run PyTorch on a GPU, it uses generic operations that work for any neural network. These operations are flexible, but flexibility comes at a cost. Remember that saying: jack of all trades, master of none.</p><p>vLLM&#8217;s team wrote custom GPU code (CUDA kernels) specifically optimized for running LLM inference. Instead of using PyTorch&#8217;s general-purpose matrix multiplications, they wrote code that does exactly what LLMs need and nothing more.</p><p>As a result, those custom kernels run 2-3x faster than PyTorch&#8217;s generic operations.</p><p>What does this mean for you? Let&#8217;s say you&#8217;re running a 7B model on a g5.xlarge instance. With generic PyTorch operations, you might process 20 tokens per second. With vLLM&#8217;s optimized kernels, you&#8217;re processing 40-60 tokens per second on the same hardware.</p><p>What&#8217;s nice is that you don&#8217;t need to understand CUDA or GPU programming to benefit from this. vLLM handles all the optimization automatically. You just call the API and get faster inference.</p><h2>Getting Started with vLLM</h2><p>Let&#8217;s get vLLM running locally first. Then we&#8217;ll connect it to the BAML code from Part 2.</p><h3>Installation</h3><h4>System Requirements</h4><ul><li><p>Linux (Ubuntu recommended)</p></li><li><p>Python 3.8+</p></li><li><p>NVIDIA GPU with CUDA support (or CPU for testing, but it&#8217;ll be slow)</p></li><li><p>At least 16GB RAM</p></li></ul><blockquote><p><strong>&#128221; Note:</strong> You can run this on a laptop with a decent GPU, but for production you&#8217;ll want a cloud GPU instance.</p></blockquote><pre><code># Install vLLM
pip install vllm

# Verify installation
python -c &#8220;import vllm; print(vllm.__version__)&#8221;</code></pre><h3>Running Your First Model</h3><p>Start with something small for testing. I&#8217;ll use Mistral-7B-Instruct, which is a popular open source model that works well for most tasks.</p><p>You can find models on the HuggingFace model hub at <a href="http://huggingface.co/models">huggingface.co/models</a>. HuggingFace has become the de facto repository for open source LLMs. HuggingFace is kind of like GitHub but for AI models.</p><p>When you see a model name like <em>mistralai/Mistral-7B-Instruct-v0.3</em>, here&#8217;s what the naming convention means: </p><ul><li><p>mistralai is the organization that created the model</p></li><li><p>Mistral-7B-Instruct is the name of the model (7B refers to the number of parameters)</p></li><li><p>v0.3 is the version number (newer versions usually fix bugs or improve performance)</p></li></ul><p>One thing to keep in mind: a 7B parameter model takes about 14GB of disk space. Make sure you have enough storage before downloading. The first time you run vLLM with a model, it&#8217;ll download automatically, which can take 5-10 minutes depending on your internet connection.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!nppf!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdea31b1a-a1c7-4985-afe7-66e5a36efcec_1210x1192.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!nppf!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdea31b1a-a1c7-4985-afe7-66e5a36efcec_1210x1192.png 424w, https://substackcdn.com/image/fetch/$s_!nppf!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdea31b1a-a1c7-4985-afe7-66e5a36efcec_1210x1192.png 848w, https://substackcdn.com/image/fetch/$s_!nppf!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdea31b1a-a1c7-4985-afe7-66e5a36efcec_1210x1192.png 1272w, https://substackcdn.com/image/fetch/$s_!nppf!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdea31b1a-a1c7-4985-afe7-66e5a36efcec_1210x1192.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!nppf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdea31b1a-a1c7-4985-afe7-66e5a36efcec_1210x1192.png" width="1210" height="1192" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dea31b1a-a1c7-4985-afe7-66e5a36efcec_1210x1192.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1192,&quot;width&quot;:1210,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:219848,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpivot.dev/i/179015509?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdea31b1a-a1c7-4985-afe7-66e5a36efcec_1210x1192.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!nppf!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdea31b1a-a1c7-4985-afe7-66e5a36efcec_1210x1192.png 424w, https://substackcdn.com/image/fetch/$s_!nppf!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdea31b1a-a1c7-4985-afe7-66e5a36efcec_1210x1192.png 848w, https://substackcdn.com/image/fetch/$s_!nppf!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdea31b1a-a1c7-4985-afe7-66e5a36efcec_1210x1192.png 1272w, https://substackcdn.com/image/fetch/$s_!nppf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdea31b1a-a1c7-4985-afe7-66e5a36efcec_1210x1192.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h5>Understanding the Code</h5><ul><li><p><code>LLM() </code>loads the model into GPU memory</p></li><li><p><code>SamplingParams()</code> controls generation (temperature, max length, etc.)</p></li><li><p><code>.generate()</code> runs inference</p></li></ul><h3>Running as an API Server</h3><p>vLLM can run as an OpenAI-compatible API server. This is a game changer.</p><p>If you&#8217;re currently using OpenAI&#8217;s API, your code probably looks something like this: </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!EPgS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F440be915-890f-4dfa-bf50-f023edcc0f9c_1446x596.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!EPgS!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F440be915-890f-4dfa-bf50-f023edcc0f9c_1446x596.png 424w, https://substackcdn.com/image/fetch/$s_!EPgS!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F440be915-890f-4dfa-bf50-f023edcc0f9c_1446x596.png 848w, https://substackcdn.com/image/fetch/$s_!EPgS!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F440be915-890f-4dfa-bf50-f023edcc0f9c_1446x596.png 1272w, https://substackcdn.com/image/fetch/$s_!EPgS!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F440be915-890f-4dfa-bf50-f023edcc0f9c_1446x596.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!EPgS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F440be915-890f-4dfa-bf50-f023edcc0f9c_1446x596.png" width="1446" height="596" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/440be915-890f-4dfa-bf50-f023edcc0f9c_1446x596.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:596,&quot;width&quot;:1446,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:113795,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpivot.dev/i/179015509?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F440be915-890f-4dfa-bf50-f023edcc0f9c_1446x596.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!EPgS!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F440be915-890f-4dfa-bf50-f023edcc0f9c_1446x596.png 424w, https://substackcdn.com/image/fetch/$s_!EPgS!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F440be915-890f-4dfa-bf50-f023edcc0f9c_1446x596.png 848w, https://substackcdn.com/image/fetch/$s_!EPgS!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F440be915-890f-4dfa-bf50-f023edcc0f9c_1446x596.png 1272w, https://substackcdn.com/image/fetch/$s_!EPgS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F440be915-890f-4dfa-bf50-f023edcc0f9c_1446x596.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>With vLLM running as an API server, you change one line:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-si6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29ef0b74-b2be-45dd-9c6b-ef2af0987039_1532x670.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-si6!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29ef0b74-b2be-45dd-9c6b-ef2af0987039_1532x670.png 424w, https://substackcdn.com/image/fetch/$s_!-si6!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29ef0b74-b2be-45dd-9c6b-ef2af0987039_1532x670.png 848w, https://substackcdn.com/image/fetch/$s_!-si6!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29ef0b74-b2be-45dd-9c6b-ef2af0987039_1532x670.png 1272w, https://substackcdn.com/image/fetch/$s_!-si6!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29ef0b74-b2be-45dd-9c6b-ef2af0987039_1532x670.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-si6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29ef0b74-b2be-45dd-9c6b-ef2af0987039_1532x670.png" width="1456" height="637" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/29ef0b74-b2be-45dd-9c6b-ef2af0987039_1532x670.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:637,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:147860,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpivot.dev/i/179015509?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29ef0b74-b2be-45dd-9c6b-ef2af0987039_1532x670.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!-si6!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29ef0b74-b2be-45dd-9c6b-ef2af0987039_1532x670.png 424w, https://substackcdn.com/image/fetch/$s_!-si6!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29ef0b74-b2be-45dd-9c6b-ef2af0987039_1532x670.png 848w, https://substackcdn.com/image/fetch/$s_!-si6!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29ef0b74-b2be-45dd-9c6b-ef2af0987039_1532x670.png 1272w, https://substackcdn.com/image/fetch/$s_!-si6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29ef0b74-b2be-45dd-9c6b-ef2af0987039_1532x670.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>That&#8217;s it. Your application code stays the same. You&#8217;re not rewriting your entire codebase. You&#8217;re not learning a new API. You just point your existing OpenAI client to a different URL.</p><p>This means you can:</p><ul><li><p>Test vLLM locally before committing to infrastructure</p></li><li><p>Switch between OpenAI and self-hosted models by changing one environment variable</p></li><li><p>Use the same codebase for both your local dev environment (vLLM) and production (OpenAI or vLLM)</p></li><li><p>Gradually migrate from OpenAI to self-hosted without a big rewrite</p></li></ul><p>The OpenAI SDK has become the de facto standard for LLM APIs. By being compatible with it, vLLM lets you experiment with self-hosting without throwing away all your existing code.</p><h3>Connecting BAML to vLLM</h3><p>This is where all that abstraction work from Part 2 pays off. Remember when we set up BAML to use different LLM providers without changing our application logic? This is exactly why we did that.</p><p>We&#8217;re going to swap out OpenAI for vLLM by changing exactly one thing: the client configuration. Your application code, your prompts, your type definitions - all of that stays exactly the same. Zero changes to your business logic!</p><h5>Update Your BAML Client</h5><p>Open your BAML configuration file (usually <code>baml_src/clients.baml</code>) and add a new client for vLLM.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!DD2I!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6785bc53-8b5a-484a-a72f-a26ec4f471b9_880x392.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DD2I!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6785bc53-8b5a-484a-a72f-a26ec4f471b9_880x392.png 424w, https://substackcdn.com/image/fetch/$s_!DD2I!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6785bc53-8b5a-484a-a72f-a26ec4f471b9_880x392.png 848w, https://substackcdn.com/image/fetch/$s_!DD2I!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6785bc53-8b5a-484a-a72f-a26ec4f471b9_880x392.png 1272w, https://substackcdn.com/image/fetch/$s_!DD2I!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6785bc53-8b5a-484a-a72f-a26ec4f471b9_880x392.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DD2I!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6785bc53-8b5a-484a-a72f-a26ec4f471b9_880x392.png" width="880" height="392" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6785bc53-8b5a-484a-a72f-a26ec4f471b9_880x392.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:392,&quot;width&quot;:880,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:62175,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpivot.dev/i/179015509?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6785bc53-8b5a-484a-a72f-a26ec4f471b9_880x392.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!DD2I!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6785bc53-8b5a-484a-a72f-a26ec4f471b9_880x392.png 424w, https://substackcdn.com/image/fetch/$s_!DD2I!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6785bc53-8b5a-484a-a72f-a26ec4f471b9_880x392.png 848w, https://substackcdn.com/image/fetch/$s_!DD2I!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6785bc53-8b5a-484a-a72f-a26ec4f471b9_880x392.png 1272w, https://substackcdn.com/image/fetch/$s_!DD2I!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6785bc53-8b5a-484a-a72f-a26ec4f471b9_880x392.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>A few things to note here.</p><p>The <code>provider</code> is still <code>openai</code> because vLLM implements the OpenAI API spec. This is what makes the whole thing work - vLLM speaks the same language as OpenAI&#8217;s API.</p><p>The <code>base_url</code> points to your local vLLM server. If you&#8217;re running vLLM on a different machine or port, change this URL accordingly.</p><p>The <code>api_key</code> can be anything. vLLM doesn&#8217;t check API keys when running locally. I just put <strong>not-needed-for-local</strong> to make it explicit.</p><p>The <code>model</code> parameter needs to match exactly what you passed to vLLM when you started the server. If you started vLLM with -<code>-model mistralai/Mistral-7B-Instruct-v0.3</code>, use that exact string here.</p><h5>Update Your Function to Use the New Client</h5><p>Now find the function you want to switch over. In Part 2, we created a resume analyzer. Here&#8217;s how you change it.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!RIBi!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97639bc7-5929-4da4-9abb-dbf80d9b0168_1224x684.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!RIBi!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97639bc7-5929-4da4-9abb-dbf80d9b0168_1224x684.png 424w, https://substackcdn.com/image/fetch/$s_!RIBi!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97639bc7-5929-4da4-9abb-dbf80d9b0168_1224x684.png 848w, https://substackcdn.com/image/fetch/$s_!RIBi!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97639bc7-5929-4da4-9abb-dbf80d9b0168_1224x684.png 1272w, https://substackcdn.com/image/fetch/$s_!RIBi!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97639bc7-5929-4da4-9abb-dbf80d9b0168_1224x684.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!RIBi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97639bc7-5929-4da4-9abb-dbf80d9b0168_1224x684.png" width="1224" height="684" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/97639bc7-5929-4da4-9abb-dbf80d9b0168_1224x684.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:684,&quot;width&quot;:1224,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:115810,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpivot.dev/i/179015509?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97639bc7-5929-4da4-9abb-dbf80d9b0168_1224x684.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!RIBi!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97639bc7-5929-4da4-9abb-dbf80d9b0168_1224x684.png 424w, https://substackcdn.com/image/fetch/$s_!RIBi!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97639bc7-5929-4da4-9abb-dbf80d9b0168_1224x684.png 848w, https://substackcdn.com/image/fetch/$s_!RIBi!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97639bc7-5929-4da4-9abb-dbf80d9b0168_1224x684.png 1272w, https://substackcdn.com/image/fetch/$s_!RIBi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97639bc7-5929-4da4-9abb-dbf80d9b0168_1224x684.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!nXGZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58468cf6-0fa5-4105-be74-8e2edc4b4385_1062x714.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!nXGZ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58468cf6-0fa5-4105-be74-8e2edc4b4385_1062x714.png 424w, https://substackcdn.com/image/fetch/$s_!nXGZ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58468cf6-0fa5-4105-be74-8e2edc4b4385_1062x714.png 848w, https://substackcdn.com/image/fetch/$s_!nXGZ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58468cf6-0fa5-4105-be74-8e2edc4b4385_1062x714.png 1272w, https://substackcdn.com/image/fetch/$s_!nXGZ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58468cf6-0fa5-4105-be74-8e2edc4b4385_1062x714.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!nXGZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58468cf6-0fa5-4105-be74-8e2edc4b4385_1062x714.png" width="1062" height="714" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/58468cf6-0fa5-4105-be74-8e2edc4b4385_1062x714.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:714,&quot;width&quot;:1062,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:111377,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpivot.dev/i/179015509?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58468cf6-0fa5-4105-be74-8e2edc4b4385_1062x714.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!nXGZ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58468cf6-0fa5-4105-be74-8e2edc4b4385_1062x714.png 424w, https://substackcdn.com/image/fetch/$s_!nXGZ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58468cf6-0fa5-4105-be74-8e2edc4b4385_1062x714.png 848w, https://substackcdn.com/image/fetch/$s_!nXGZ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58468cf6-0fa5-4105-be74-8e2edc4b4385_1062x714.png 1272w, https://substackcdn.com/image/fetch/$s_!nXGZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58468cf6-0fa5-4105-be74-8e2edc4b4385_1062x714.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Everything else remains the same except for that one line.</p><h5>Test It Works</h5><p>Now let&#8217;s verify that vLLM is actually being used and returning valid results by creating a test file.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!4asI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4eb02df-4723-4c6b-ad0d-33529420d13e_1210x968.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!4asI!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4eb02df-4723-4c6b-ad0d-33529420d13e_1210x968.png 424w, https://substackcdn.com/image/fetch/$s_!4asI!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4eb02df-4723-4c6b-ad0d-33529420d13e_1210x968.png 848w, https://substackcdn.com/image/fetch/$s_!4asI!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4eb02df-4723-4c6b-ad0d-33529420d13e_1210x968.png 1272w, https://substackcdn.com/image/fetch/$s_!4asI!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4eb02df-4723-4c6b-ad0d-33529420d13e_1210x968.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!4asI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4eb02df-4723-4c6b-ad0d-33529420d13e_1210x968.png" width="1210" height="968" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a4eb02df-4723-4c6b-ad0d-33529420d13e_1210x968.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:968,&quot;width&quot;:1210,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:195032,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpivot.dev/i/179015509?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4eb02df-4723-4c6b-ad0d-33529420d13e_1210x968.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!4asI!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4eb02df-4723-4c6b-ad0d-33529420d13e_1210x968.png 424w, https://substackcdn.com/image/fetch/$s_!4asI!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4eb02df-4723-4c6b-ad0d-33529420d13e_1210x968.png 848w, https://substackcdn.com/image/fetch/$s_!4asI!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4eb02df-4723-4c6b-ad0d-33529420d13e_1210x968.png 1272w, https://substackcdn.com/image/fetch/$s_!4asI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4eb02df-4723-4c6b-ad0d-33529420d13e_1210x968.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Run the script:</p><pre><code>python test_vllm.py</code></pre><p>You should see output like:</p><pre><code>Name: John Doe
Experience: 5
Skills: [&#8217;Python&#8217;, &#8216;JavaScript&#8217;, &#8216;Docker&#8217;]</code></pre><p>If you get an error like <strong>Connection refused</strong> or <strong>Failed to connect</strong>, check that your vLLM server is actually running on port 8000. Go back to the terminal where you started vLLM and make sure it says &#8220;Running on http://0.0.0.0:8000&#8221;</p><h5>This is the Payoff</h5><p>Let&#8217;s take a step back and appreciate what just happened. You switched from a $0.00275/request API (GPT-4) to a self-hosted model that costs $0.000088/request (assuming our $727/month infrastructure costs and 2M requests). That&#8217;s a 97% cost reduction.</p><p>And how much code did you change? One line.</p><p>This is why we built that abstraction in Part 2. Without BAML, you&#8217;d be rewriting your entire application - updating every API call, changing response parsing logic, handling different error formats, rewriting tests. Instead, you changed client <code>GPT4</code> to client <code>LocalVLLM</code>.</p><p>You can now:</p><ul><li><p>Test locally with vLLM during development</p></li><li><p>Switch to OpenAI for staging/testing if you want higher quality</p></li><li><p>Use different models for different functions (cheap models for simple tasks, expensive models for complex ones)</p></li><li><p>Implement fallback logic (try vLLM first, fall back to OpenAI if it fails)</p></li></ul><p>All without touching your application code. Just configuration changes.</p><p>That&#8217;s the power of abstraction!</p><h2>Production Setup</h2><p>Running vLLM locally on your laptop is great for testing and development. But when you&#8217;re ready to deploy to production and serve real user traffic, you need a more robust setup. Production environments require three things that local testing doesn&#8217;t: reproducibility (it should work the same way on any machine), proper configuration (tuned for your specific workload), and monitoring (so you know when things break).</p><p>Let&#8217;s set up vLLM for production using Docker.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://mlpivot.dev/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h3>Docker Setup</h3><p>I know some people groan when they hear &#8220;Docker,&#8221; but hear me out. Docker solves real problems that you&#8217;ll encounter in production.</p><p>Without Docker, you&#8217;re installing vLLM directly on your production server. That means you&#8217;re managing Python versions, CUDA drivers, system dependencies, and all the little quirks of that specific machine. If your server crashes and you need to spin up a replacement, you&#8217;re manually recreating that environment. If it works on your machine but fails in production, you&#8217;re spending hours debugging environment differences.</p><p>Docker packages everything your application needs into a container. The same container that works on your laptop will work on AWS, Google Cloud, your company&#8217;s on-prem servers, or anywhere else. You deploy once, and it runs everywhere identically.</p><p>Here&#8217;s what we&#8217;re going to build: a Docker container that includes Python, vLLM, and all dependencies. We&#8217;ll use Docker Compose to manage the container and configure GPU access. Then we&#8217;ll add environment variables so you can adjust settings without rebuilding the container.</p><h5>Create Your Dockerfile</h5><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_KYI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F070a78c8-3d88-4e58-bb14-bf998660780a_1026x1116.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_KYI!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F070a78c8-3d88-4e58-bb14-bf998660780a_1026x1116.png 424w, https://substackcdn.com/image/fetch/$s_!_KYI!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F070a78c8-3d88-4e58-bb14-bf998660780a_1026x1116.png 848w, https://substackcdn.com/image/fetch/$s_!_KYI!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F070a78c8-3d88-4e58-bb14-bf998660780a_1026x1116.png 1272w, https://substackcdn.com/image/fetch/$s_!_KYI!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F070a78c8-3d88-4e58-bb14-bf998660780a_1026x1116.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_KYI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F070a78c8-3d88-4e58-bb14-bf998660780a_1026x1116.png" width="1026" height="1116" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/070a78c8-3d88-4e58-bb14-bf998660780a_1026x1116.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1116,&quot;width&quot;:1026,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:183053,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpivot.dev/i/179015509?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F070a78c8-3d88-4e58-bb14-bf998660780a_1026x1116.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!_KYI!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F070a78c8-3d88-4e58-bb14-bf998660780a_1026x1116.png 424w, https://substackcdn.com/image/fetch/$s_!_KYI!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F070a78c8-3d88-4e58-bb14-bf998660780a_1026x1116.png 848w, https://substackcdn.com/image/fetch/$s_!_KYI!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F070a78c8-3d88-4e58-bb14-bf998660780a_1026x1116.png 1272w, https://substackcdn.com/image/fetch/$s_!_KYI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F070a78c8-3d88-4e58-bb14-bf998660780a_1026x1116.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Let me walk through what each section does.</p><p>We&#8217;re starting from <code>nvidia/cuda:12.1.0-runtime-ubuntu22.04</code>, which is NVIDIA&#8217;s official image that includes CUDA drivers. CUDA is what lets vLLM talk to your GPU. Without this base image, vLLM can&#8217;t access the GPU and will fall back to CPU inference (which is painfully slow for LLMs).</p><p>The first <code>RUN</code> command installs Python 3.10 and pip. We&#8217;re also running rm <code>-rf /var/lib/apt/lists/*</code> at the end to delete the package manager&#8217;s cache, which reduces the final image size by a few hundred megabytes.</p><p>The second <code>RUN</code> command installs vLLM via pip. This will take a while the first time you build the image (vLLM and its dependencies are several gigabytes), but Docker caches this layer so subsequent builds are faster.</p><p>We&#8217;re copying a startup script (which we&#8217;ll create next) into the container. The <code>chmod +x</code> command makes the script executable.</p><p><code>EXPOSE 8000 </code>documents that this container listens on port 8000. This doesn&#8217;t actually open the port (Docker Compose handles that), but it&#8217;s good documentation for anyone reading your Dockerfile.</p><p>Finally, <code>CMD</code> tells Docker what command to run when the container starts. We&#8217;re running our startup script.</p><h5>Create Your Startup Script</h5><p>Create a file called <code>start_vllm.sh</code> in the same directory.</p><pre><code>#!/bin/bash
python3 -m vllm.entrypoints.openai.api_server \
    --model ${MODEL_NAME:-mistralai/Mistral-7B-Instruct-v0.3} \
    --host 0.0.0.0 \
    --port 8000 \
    --gpu-memory-utilization ${GPU_MEMORY:-0.9} \
    --max-model-len ${MAX_LENGTH:-4096}</code></pre><p>This script starts vLLM with configuration passed via environment variables. The<code> ${MODEL_NAME:-mistralai/Mistral-7B-Instruct-v0.3} </code>syntax means &#8220;use the MODEL_NAME environment variable if it exists, otherwise default to mistralai/Mistral-7B-Instruct-v0.3.&#8221; This lets you change the model without rebuilding the Docker image.</p><p>The <code>--host 0.0.0.0 </code>flag tells vLLM to listen on all network interfaces. Inside a Docker container, this is necessary for external connections to reach vLLM. If you used <code>127.0.0.1 </code>(localhost), only processes inside the container could connect, which defeats the purpose.</p><h5>Create Your Docker Compose File</h5><p>Create a file called <code>docker-compose.yml</code>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!EgVY!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22cb86ba-48cc-4491-bbe0-924ac3a7a072_1194x1452.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!EgVY!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22cb86ba-48cc-4491-bbe0-924ac3a7a072_1194x1452.png 424w, https://substackcdn.com/image/fetch/$s_!EgVY!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22cb86ba-48cc-4491-bbe0-924ac3a7a072_1194x1452.png 848w, https://substackcdn.com/image/fetch/$s_!EgVY!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22cb86ba-48cc-4491-bbe0-924ac3a7a072_1194x1452.png 1272w, https://substackcdn.com/image/fetch/$s_!EgVY!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22cb86ba-48cc-4491-bbe0-924ac3a7a072_1194x1452.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!EgVY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22cb86ba-48cc-4491-bbe0-924ac3a7a072_1194x1452.png" width="1194" height="1452" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/22cb86ba-48cc-4491-bbe0-924ac3a7a072_1194x1452.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1452,&quot;width&quot;:1194,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:221278,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpivot.dev/i/179015509?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22cb86ba-48cc-4491-bbe0-924ac3a7a072_1194x1452.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!EgVY!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22cb86ba-48cc-4491-bbe0-924ac3a7a072_1194x1452.png 424w, https://substackcdn.com/image/fetch/$s_!EgVY!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22cb86ba-48cc-4491-bbe0-924ac3a7a072_1194x1452.png 848w, https://substackcdn.com/image/fetch/$s_!EgVY!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22cb86ba-48cc-4491-bbe0-924ac3a7a072_1194x1452.png 1272w, https://substackcdn.com/image/fetch/$s_!EgVY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22cb86ba-48cc-4491-bbe0-924ac3a7a072_1194x1452.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Docker Compose lets you define multiple services (containers) and how they interact. We&#8217;re defining two services: <code>vllm</code> (the inference server) and <code>app</code> (your application that calls vLLM).</p><p>The <code>ports</code> section maps port 8000 inside the container to port 8000 on your host machine. This lets you access vLLM at <code>http://localhost:8000</code> from outside the container.</p><blockquote><p><strong>Note &#128221;:</strong> You don&#8217;t have to use port 8000. In fact, you have over 65,000 network ports to use on your machine!</p></blockquote><p>The <code>deploy.resources.reservations.devices</code> section gives the container access to your GPU. Without this, Docker won&#8217;t let the container see the GPU, and vLLM will fail to start.</p><p>The <code>volumes</code> section mounts a local directory (<code>./models</code>) into the container at <code>/root/.cache/huggingface</code>. This is where HuggingFace stores downloaded models. By mounting a local directory, the model persists between container restarts. Without this, you&#8217;d re-download 14GB every time you restart the container.</p><p>The <code>restart: unless-stopped</code> policy automatically restarts the container if it crashes. In production, you want this. If vLLM hits an out-of-memory error and crashes, the container will automatically restart and try again.</p><p>The <code>app</code> service is your application code. The <code>depends_on: vllm</code> line tells Docker Compose to start vLLM before starting your app. The <code>VLLM_URL</code> environment variable tells your app where to find vLLM. Since both services are in the same Docker Compose network, your app can access vLLM at <code>http://vllm:8000</code> (using the service name as the hostname).</p><h5>Build and Run Everything</h5><pre><code>docker-compose up -d</code></pre><p>The <code>-d</code> flag runs containers in detached mode (in the background). Without it, Docker Compose would print logs to your terminal and you&#8217;d need to keep that terminal window open.</p><p>On the first run, Docker will:</p><ol><li><p>Download the base NVIDIA CUDA image (~4GB)</p></li><li><p>Build your vLLM image (~6-12GB depending on dependencies)</p></li><li><p>Start the containers</p></li><li><p>Download the Mistral-7B model (~14GB)</p></li></ol><p>This takes 10-30 minutes depending on your internet connection. Subsequent starts are much faster because everything is cached.</p><p>You can check if vLLM is running:</p><pre><code>curl http://localhost:8000/health</code></pre><p>If you get a 200 response, vLLM is up and ready to serve requests.</p><h3>Configuration Options</h3><p>Now that vLLM is running in Docker, let&#8217;s talk about the configuration flags we&#8217;re using and what they actually do.</p><h5>GPU Memory Utilization</h5><p>The <code>--gpu-memory-utilization</code> flag controls how much of your GPU&#8217;s memory vLLM will use. It&#8217;s a float between 0.0 and 1.0, representing a percentage.</p><p>On a g5.xlarge with 24GB of GPU memory, setting this to 0.9 means vLLM will use up to 21.6GB (24GB &#215; 0.9). The remaining 2.4GB is reserved for CUDA operations, PyTorch overhead, and a safety buffer to prevent out-of-memory errors.</p><p>You might think &#8220;why not use 100% of the GPU?&#8221; Because GPUs need some memory for internal operations. If you set this to 1.0, you&#8217;ll see random OOM crashes when vLLM tries to allocate just a bit more memory than available. Setting it to 0.9 gives you a buffer.</p><p>If you&#8217;re getting OOM errors even at 0.9, lower this to 0.85 or 0.8. You&#8217;ll handle fewer concurrent requests, but the server will be more stable.</p><p>If your GPU is barely utilized (you&#8217;re seeing 40-50% memory usage in <code>nvidia-smi</code>), you can try increasing this to 0.95. More memory means more concurrent requests, which means higher throughput.</p><p>Start with 0.9. Only adjust if you&#8217;re seeing problems.</p><h5>Max Model Length</h5><p>The <code>--max-model-len</code> flag sets the maximum number of tokens (input + output) that a single request can use.</p><p>Mistral-7B technically supports up to 8192 tokens (about 6000 words). But longer contexts use more memory. If you set <code>--max-model-len 8192</code>, each request reserves enough memory for 8192 tokens, even if most requests only use 500 tokens. <strong>You&#8217;re wasting memory</strong>.</p><p>Think about your actual use case. If you&#8217;re analyzing resumes (like our example), most resumes are 500-1000 words. That&#8217;s roughly 700-1400 tokens. Setting <code>--max-model-len 2048</code> gives you plenty of headroom and doesn&#8217;t waste memory on unused capacity.</p><p>If you set this too low, requests with longer inputs will get truncated or rejected. If you set it too high, you&#8217;re using memory inefficiently and handling fewer concurrent requests.</p><p>Here&#8217;s a rule of thumb based on your use case:</p><p>&#128161; For resume analysis or short document processing, use 2048. Most business documents are under 1000 words.</p><p>&#128161; For customer support chat where context from previous messages matters, use 4096. You need room for conversation history.</p><p>&#128161; For long document summarization where you&#8217;re feeding in entire articles, use 8192. But be aware this significantly reduces your concurrent request capacity.</p><p>You can see the tradeoff: on a 24GB GPU with max-model-len=2048, you might handle 12 concurrent requests. With max-model-len=8192, you might only handle 4 concurrent requests. Choose based on your actual needs, not the maximum possible value.</p><h5>Tensor Parallelism</h5><p>The <code>--tensor-parallel-size </code>flag splits a model across multiple GPUs. This is only relevant if you&#8217;re running models that don&#8217;t fit on a single GPU.</p><p>For 7B and 13B models on a 24GB or 40GB GPU, you don&#8217;t need this. The model fits comfortably on one GPU.</p><p>For 70B models, you&#8217;d need something like <code>--tensor-parallel-size 4</code> to split the model across 4 GPUs. Each GPU holds 1/4 of the model weights.</p><p>Unless you&#8217;re running very large models (70B+), ignore this flag entirely. Setting it when you don&#8217;t need it just adds complexity and slightly reduces performance due to inter-GPU communication overhead.</p><h5>Configuration Examples for Different Use Cases</h5><p>Let me show you hypothetical configurations for different scenarios.</p><p><strong>Resume Analysis (our example)</strong></p><ul><li><p>Model: Mistral-7B</p></li><li><p>GPU: 24GB A10</p></li><li><p>Max Model Length: 2048</p></li><li><p>GPU Memory Utilization: 0.9</p></li><li><p>Expected Throughput: 20-50 requests/sec</p></li></ul><p>This configuration handles typical resumes lengths (1-2 pages) efficiently. You&#8217;re not wasting memory on unnecessarily long context windows.</p><p><strong>Customer Support Chatbot</strong></p><ul><li><p>Model: Llama-13B (better quality for complex queries)</p></li><li><p>GPU: 40GB A100</p></li><li><p>Max Model Length: 4096 (need room for conversation history)</p></li><li><p>GPU Memory Utilization: 0.85 (being conservative with a larger model)</p></li><li><p>Expected Throughput: 12-15 requests/sec</p></li></ul><p>Customer support needs higher quality responses than resume parsing, so we&#8217;re using a 13B model. The longer context window accommodates conversation history (10-15 messages back and forth).</p><p><strong>Document Summarization</strong></p><ul><li><p>Model: Mistral-7B</p></li><li><p>GPU: 24GB A10</p></li><li><p>Max Model Length: 8192 (processing long articles)</p></li><li><p>GPU Memory Utilization: 0.8 (lower because of long contexts)</p></li><li><p>Expected Throughput: 10-15 requests/sec</p></li></ul><p>Long contexts eat memory. We&#8217;re being conservative with GPU memory utilization and accepting lower throughput. If you need to summarize entire research papers or long-form articles, you need the full 8192 token context.</p><p>Notice the pattern? Smaller models, shorter contexts, and higher memory utilization give you better throughput. Larger models, longer contexts, and conservative memory settings give you better quality and stability. There&#8217;s always a tradeoff.</p><p>Configure based on what your application actually needs, not what&#8217;s theoretically possible. </p><h3>Model Selection</h3><p>Choosing the right model is probably the most important decision you&#8217;ll make when self-hosting. Get this wrong and you&#8217;re either wasting money on GPU power you don&#8217;t need, or delivering poor quality results that frustrate your users.</p><p>The good news is that for most applications, you don&#8217;t need the biggest, fanciest model. Let me walk you through your options and help you figure out what makes sense for your use case.</p><h5>Starting with 7B Models</h5><p>If this is your first time self-hosting an LLM, start with a 7B parameter model like Mistral-7B-Instruct or Gemma-7B. These models are the sweet spot for most applications.</p><p>A 7B model means the model has 7 billion parameters. Parameters are the weights and connections in the neural network that the model learned during training. More parameters generally means better quality, but also means more GPU memory, slower inference, and higher costs.</p><p>7B models fit comfortably on a 24GB GPU like the g5.xlarge we&#8217;ve been using. You&#8217;ll need at least 16GB of VRAM to run them, but 24GB gives you room for batching multiple requests and handling longer contexts. With a 7B model on a g5.xlarge, you&#8217;re looking at processing 50-100 tokens per second. That&#8217;s fast enough that users won&#8217;t notice any lag.</p><p>The quality is good for most tasks. Tasks like resume parsing, email classification, simple Q&amp;A, content generation, basic code comments, and data extraction. If you&#8217;re building something like &#8220;analyze this customer support ticket and categorize it,&#8221; a 7B model will nail it.</p><p>About 60-70% of LLM applications work perfectly fine with 7B models.</p><h5>When To Move Up to 13B Models</h5><p>Sometimes you need better reasoning or more nuanced understanding. That&#8217;s when you consider a 13B model like Llama-2-13B.</p><p>The jump from 7B to 13B isn&#8217;t just about size. The quality difference is noticeable. A 13B model can handle more complex instructions, follow multi-step reasoning better, and produce more coherent long-form content.</p><p>This is what changes when you move to 13B.</p><p>You need more GPU memory. A 13B model technically works on a 24GB GPU but you&#8217;ll be pushing the limits. For production, you want a 40GB GPU like the A100. That gives you breathing room for batching and longer contexts without hitting out-of-memory errors.</p><p>Your speed drops to about 30-50 tokens per second. That&#8217;s still pretty fast for most applications, but you&#8217;re processing roughly half as many requests per second compared to a 7B model.</p><p>When should you use a 13B model? Complex customer support where you need to understand nuance and tone. Code generation beyond simple snippets. Content that requires maintaining context over multiple paragraphs. Analysis tasks where you need the model to make inferences beyond what&#8217;s explicitly stated.</p><h5>The 70B Question</h5><p>Llama-3-70B and similar models are the best open source models available. They&#8217;re approaching GPT-3.5 quality. <em><strong>However</strong></em>, they&#8217;re rarely worth it for self-hosting.</p><p>A 70B model requires multiple GPUs. You&#8217;re looking at 4 GPUs with 40GB each, using tensor parallelism to split the model across them. That&#8217;s roughly $10,000-$23,000 per month in cloud infrastructure costs if you&#8217;re using AWS p4d instances or Google Cloud A2 instances. On-premises is even worse upfront (4x A100 40GB GPUs cost $40,000-60,000 just for the hardware, plus servers, cooling, and power). You&#8217;re processing maybe 10-20 tokens per second.</p><p>At that point, just use GPT-4 via the API. You get better quality, faster inference, no infrastructure management, and probably lower costs unless you&#8217;re doing truly massive volume (we&#8217;re talking 50+ million requests per month to break even on those infrastructure costs).</p><p>I&#8217;ve only seen 70B models make sense for self-hosting in two scenarios. First, you have extremely strict data privacy requirements and absolutely cannot send data to OpenAI, even encrypted. Second, you&#8217;re processing tens of millions of requests per month and the cost savings of self-hosting actually pencil out even with multiple A100 GPUs.</p><p>For everyone else? If you need 70B-level quality, use the API. Self-hosting is about saving money at scale, not about having the biggest model.</p><h5>A Critical Note About Instruction-Tuned Models</h5><p>Always use instruction-tuned models for production. These are models with &#8220;-Instruct&#8221; or &#8220;-Chat&#8221; in the name, like &#8220;Mistral-7B-Instruct-v0.3&#8221; or &#8220;Llama-2-7B-Chat.&#8221;</p><p>Base models (like &#8220;Mistral-7B-v0.1&#8221; without the &#8220;Instruct&#8221; suffix) are not trained to follow instructions. They&#8217;re trained to predict the next token, which means they&#8217;ll continue your prompt rather than answer your question. If you prompt a base model with &#8220;Write a professional email to schedule a meeting,&#8221; it might just generate &#8220;Write a professional email to schedule a doctor&#8217;s appointment. Write a professional email to...&#8221; and keep going.</p><p>Instruction-tuned models are fine-tuned specifically to understand and follow instructions. They know when to stop. They format their responses appropriately. They understand conversational context.</p><blockquote><p>&#128161; Base models are for researchers who want to fine-tune their own models. Instruction-tuned models are for everyone actually building applications. Use the instruction-tuned versions.</p></blockquote><h3>Benchmarking Performance</h3><p>Once you&#8217;ve got vLLM running, you need to know if it&#8217;s actually performing well.</p><p>I&#8217;ll show you how to benchmark your vLLM setup properly so you know what kind of performance to expect.</p><h5>Create a Simple Benchmark Script</h5><p>Here&#8217;s a Python script that sends 1000 concurrent requests to your vLLM server and measures throughput and latency.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!FjVe!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf20f056-34ce-4c74-b90d-cc13c92cb51e_1548x1378.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!FjVe!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf20f056-34ce-4c74-b90d-cc13c92cb51e_1548x1378.png 424w, https://substackcdn.com/image/fetch/$s_!FjVe!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf20f056-34ce-4c74-b90d-cc13c92cb51e_1548x1378.png 848w, https://substackcdn.com/image/fetch/$s_!FjVe!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf20f056-34ce-4c74-b90d-cc13c92cb51e_1548x1378.png 1272w, https://substackcdn.com/image/fetch/$s_!FjVe!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf20f056-34ce-4c74-b90d-cc13c92cb51e_1548x1378.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!FjVe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf20f056-34ce-4c74-b90d-cc13c92cb51e_1548x1378.png" width="1456" height="1296" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cf20f056-34ce-4c74-b90d-cc13c92cb51e_1548x1378.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1296,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:300242,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpivot.dev/i/179015509?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf20f056-34ce-4c74-b90d-cc13c92cb51e_1548x1378.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!FjVe!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf20f056-34ce-4c74-b90d-cc13c92cb51e_1548x1378.png 424w, https://substackcdn.com/image/fetch/$s_!FjVe!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf20f056-34ce-4c74-b90d-cc13c92cb51e_1548x1378.png 848w, https://substackcdn.com/image/fetch/$s_!FjVe!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf20f056-34ce-4c74-b90d-cc13c92cb51e_1548x1378.png 1272w, https://substackcdn.com/image/fetch/$s_!FjVe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf20f056-34ce-4c74-b90d-cc13c92cb51e_1548x1378.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This script does something important: it tests <strong>concurrent requests</strong>, not sequential ones. In production, you&#8217;ll have multiple users hitting your API simultaneously. Sequential testing (sending one request, waiting for a response, sending another request) doesn&#8217;t reflect real-world usage.</p><p>The script creates 100 tasks and uses <code>asyncio.gather()</code> to send them all at once. This simulates 100 users hitting your endpoint simultaneously. That&#8217;s what actually stresses vLLM and shows you real performance characteristics.</p><p>Save this as<code> benchmark.py </code>and run it.</p><pre><code>python benchmark.py</code></pre><p>You&#8217;ll see output like:</p><pre><code>Processed 100 requests in 2.47 seconds
Throughput: 40.49 requests/second
Average latency: 24.7 ms/request</code></pre><h3>Troubleshooting Performance</h3><h5>Performance is Worse than Expected</h5><p>If you&#8217;re seeing less than ideal numbers, something is up.</p><p>First, check your GPU utilization. Run <code>nvidia-smi </code>in another terminal while your benchmark is running. You should see GPU utilization at 80-95%. If it&#8217;s lower, you&#8217;re not batching effectively or your requests are too small to keep the GPU busy. If it&#8217;s at 100% constantly, you&#8217;re GPU-bound and might benefit from a larger GPU or smaller model.</p><p>Second, check your memory usage. Also visible in <code>nvidia-smi</code>. If you&#8217;re close to your memory limit, you might be hitting out-of-memory errors intermittently. Try reducing <code>--max-model-len</code> to free up memory. Each reduction in max length lets you handle more concurrent requests.</p><p>Third, if you have headroom on memory (you&#8217;re only using 60-70% of available VRAM), try increasing <code>--gpu-memory-utilization </code>from 0.9 to 0.95. More available memory means more concurrent requests in the batch.</p><p>Fourth, consider whether your GPU is actually the bottleneck or if it&#8217;s something else.</p><h5>What if You Need Better Performance</h5><p>You have a few options. The easiest is to scale horizontally. Run multiple vLLM instances behind a load balancer. Each instance handles its own batch of requests. This works well because LLM inference is embarrassingly parallel; one request doesn&#8217;t depend on another.</p><p>You can also try a smaller model. Mistral-7B-Instruct might perform almost as well as a 13B model for your specific use case. Run quality tests, but don&#8217;t assume you need the bigger model.</p><p>Or upgrade your GPU. A g5.2xlarge has the same GPU as a g5.xlarge (A10) but with more system RAM and CPU, which can help with pre/post-processing. For significantly better performance, consider the A100 (40GB or 80GB versions).</p><p>The important thing is to benchmark early and often. Don&#8217;t wait until production to discover your setup can&#8217;t handle the load. Run benchmarks during development. Run them after every configuration change. Know your numbers!</p><h2>Monitoring and Maintenance</h2><p>Running vLLM in production isn&#8217;t a &#8220;set it and forget it&#8221; situation. Production systems fail. GPUs run out of memory. Network connections drop. Models get stuck in weird states. You need to know when these things happen and why they&#8217;re happening.</p><h3>Basic Health Checks</h3><p>The simplest form of monitoring is a health check endpoint. vLLM exposes one at <code>/health</code> that returns a 200 status code if everything is working. </p><p>You can test it manually:</p><pre><code>curl http://localhost:8000/health</code></pre><p>If you get back a 200 status code, vLLM is running and accepting requests. If you get a connection error or timeout, something is wrong.</p><h5>Automated Health Checks in Python</h5><p>Manual curl commands are fine for testing, but in production you want automated health checks that run continuously. Here&#8217;s a Python script that checks vLLM&#8217;s health and alerts you if something goes wrong.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!8KtR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a009253-1ec3-492d-89cd-ada09cfd081b_1548x1564.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!8KtR!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a009253-1ec3-492d-89cd-ada09cfd081b_1548x1564.png 424w, https://substackcdn.com/image/fetch/$s_!8KtR!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a009253-1ec3-492d-89cd-ada09cfd081b_1548x1564.png 848w, https://substackcdn.com/image/fetch/$s_!8KtR!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a009253-1ec3-492d-89cd-ada09cfd081b_1548x1564.png 1272w, https://substackcdn.com/image/fetch/$s_!8KtR!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a009253-1ec3-492d-89cd-ada09cfd081b_1548x1564.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!8KtR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a009253-1ec3-492d-89cd-ada09cfd081b_1548x1564.png" width="1456" height="1471" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8a009253-1ec3-492d-89cd-ada09cfd081b_1548x1564.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1471,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:365244,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpivot.dev/i/179015509?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a009253-1ec3-492d-89cd-ada09cfd081b_1548x1564.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!8KtR!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a009253-1ec3-492d-89cd-ada09cfd081b_1548x1564.png 424w, https://substackcdn.com/image/fetch/$s_!8KtR!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a009253-1ec3-492d-89cd-ada09cfd081b_1548x1564.png 848w, https://substackcdn.com/image/fetch/$s_!8KtR!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a009253-1ec3-492d-89cd-ada09cfd081b_1548x1564.png 1272w, https://substackcdn.com/image/fetch/$s_!8KtR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a009253-1ec3-492d-89cd-ada09cfd081b_1548x1564.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This script checks vLLM&#8217;s health every 30 seconds. If the health check fails, you&#8217;d want to send yourself an alert (via email, Slack, PagerDuty, or whatever alerting system your team uses). I&#8217;m just printing to the console here, but in production you&#8217;d integrate with your monitoring infrastructure.</p><p>The <code>timeout=5 </code>parameter is important. Without a timeout, the request could hang forever if vLLM is stuck. Five seconds is reasonable for a health check - if vLLM can&#8217;t respond in five seconds, something is seriously wrong.</p><h3>Key Metrics to Watch</h3><p>Health checks tell you if vLLM is running, but they don&#8217;t tell you if it&#8217;s running well. You need to monitor actual performance metrics to catch problems before they become outages.</p><h5>GPU Utilization</h5><p>This is the single most important metric to watch. GPU utilization tells you what percentage of your GPU&#8217;s compute capacity is being used.</p><p>You can check GPU utilization with <code>nvidia-smi</code>.</p><pre><code>nvidia-smi</code></pre><p>This displays a table showing GPU memory usage and utilization percentage. You&#8217;re looking for the line that says something like &#8220;GPU-Util: 87%&#8221;.</p><p>During active request processing, your GPU utilization should be 80-95%. If it&#8217;s consistently lower (say, 40-60%), you&#8217;re not batching requests effectively or your requests are too small to keep the GPU busy. You&#8217;re paying for GPU compute that you&#8217;re not using.</p><p>If your GPU utilization is at 100% constantly, you&#8217;re GPU-bound. Every request is waiting for GPU time. This isn&#8217;t necessarily bad if your latency is acceptable, but it means you can&#8217;t handle any more load without adding more GPUs.</p><p>Here&#8217;s what I typically see in production:</p><p>During normal business hours (moderate load), GPU utilization bounces between 70-90%. The GPU is busy but not maxed out. There&#8217;s headroom for traffic spikes.</p><p>During traffic spikes (like when a batch job runs and sends 1000 requests at once), GPU utilization hits 95-100% temporarily. That&#8217;s fine for short bursts.</p><p>During off-hours (minimal load), GPU utilization drops to 10-30%. The GPU is mostly idle, which is expected when there&#8217;s no traffic.</p><p>If your GPU utilization is consistently low even during peak hours, something is misconfigured. Check your batching settings and make sure requests are actually reaching vLLM.</p><h5>Memory Usage</h5><p>GPU memory usage tells you how much of your GPU&#8217;s VRAM is being used. This is different from utilization (which is compute capacity).</p><p>With <code>nvidia-smi</code>, you&#8217;ll see something like &#8220;15234MiB / 24576MiB&#8221; which means 15GB used out of 24GB available.</p><p>You want your memory usage to be high but not at the limit. If you set <code>--gpu-memory-utilization 0.9 </code>and you&#8217;re seeing 21-22GB used on a 24GB GPU, that&#8217;s perfect. You&#8217;re using the memory you allocated.</p><p>If you&#8217;re consistently hitting the memory limit (23.9GB used on a 24GB GPU with no headroom), you&#8217;re at risk of out-of-memory errors. Any request that needs slightly more memory than usual will crash vLLM. Lower your <code>--max-model-len </code>or <code>--gpu-memory-utilization </code>setting to create a buffer.</p><p>If you&#8217;re only using 12GB on a 24GB GPU, you&#8217;re being too conservative. You could handle more concurrent requests or longer contexts. <code>Try increasing --gpu-memory-utilization</code> from 0.9 to 0.95.</p><p>The worst scenario is seeing memory usage gradually climb over time. Let&#8217;s say it starts at 15GB, then after an hour it&#8217;s 18GB, then after two hours it&#8217;s 21GB, and eventually it hits 24GB and crashes. That&#8217;s a memory leak. Restart vLLM and check if there&#8217;s a known bug in your version. Memory leaks are rare in vLLM but not impossible.</p><h5>Request Queue Depth</h5><p>vLLM maintains an internal queue of requests waiting to be processed. When your GPU is busy processing a batch, new incoming requests sit in this queue.</p><p>vLLM logs the queue depth periodically. You&#8217;ll see log messages like:</p><pre><code>INFO: Queue depth: 3
INFO: Queue depth: 7
INFO: Queue depth: 12</code></pre><p>A small queue (1-5 requests) is normal and healthy. It means requests are arriving slightly faster than you can process them, but you&#8217;re keeping up.</p><p>A growing queue is a red flag. If you see the queue depth climbing from 5 to 10 to 20 to 50 over the course of a few minutes, you&#8217;re not keeping up with incoming traffic. Requests are piling up faster than you can process them.</p><p>What happens when the queue gets too large? Latency explodes. If there are 50 requests in the queue and you process 5 requests per second, the 50th request waits 10 seconds just to start processing. Users don&#8217;t wait 10 seconds - they timeout and retry, which adds even more requests to the queue. Death spiral &#128565;&#8205;&#128171;.</p><p>If you see your queue growing consistently, you need more capacity. Either scale horizontally (add more vLLM instances behind a load balancer) or vertically (move to a larger GPU that can process requests faster).</p><h5>Latency Percentiles</h5><p>Average latency is useful, but it doesn&#8217;t tell the whole story. You need to track latency percentiles to understand the full distribution.</p><p>Here&#8217;s what each percentile means.</p><p>p50 (median) is the latency that half your requests experience. If p50 is 200ms, half your requests complete in under 200ms.</p><p>p95 means 95% of requests complete under this threshold. If p95 is 500ms, 5% of requests take longer than 500ms.</p><p>p99 means 99% of requests complete under this threshold. If p99 is 1000ms, 1% of requests take longer than 1 second.</p><p>Why track percentiles instead of just average? Because outliers matter. You could have an average latency of 300ms but a p99 latency of 5 seconds. That means 1 in 100 users is having a terrible experience, even though your average looks great.</p><p>If your p99 is over 2 seconds, users are definitely noticing. Some are probably timing out. You need to investigate why the slowest 1% of requests are so slow. Possible causes: occasional long inputs that hit your <code>max-model-len </code>limit, memory pressure causing occasional slowdowns, or queue backups during traffic spikes.</p><p>To track percentiles in Python, you can use a library like <code>prometheus_client</code> or just log request latencies and calculate percentiles periodically.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!BKmt!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6696ec79-bac8-4c68-b74f-271283700936_1262x1302.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!BKmt!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6696ec79-bac8-4c68-b74f-271283700936_1262x1302.png 424w, https://substackcdn.com/image/fetch/$s_!BKmt!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6696ec79-bac8-4c68-b74f-271283700936_1262x1302.png 848w, https://substackcdn.com/image/fetch/$s_!BKmt!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6696ec79-bac8-4c68-b74f-271283700936_1262x1302.png 1272w, https://substackcdn.com/image/fetch/$s_!BKmt!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6696ec79-bac8-4c68-b74f-271283700936_1262x1302.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!BKmt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6696ec79-bac8-4c68-b74f-271283700936_1262x1302.png" width="1262" height="1302" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6696ec79-bac8-4c68-b74f-271283700936_1262x1302.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1302,&quot;width&quot;:1262,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:283972,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpivot.dev/i/179015509?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6696ec79-bac8-4c68-b74f-271283700936_1262x1302.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!BKmt!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6696ec79-bac8-4c68-b74f-271283700936_1262x1302.png 424w, https://substackcdn.com/image/fetch/$s_!BKmt!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6696ec79-bac8-4c68-b74f-271283700936_1262x1302.png 848w, https://substackcdn.com/image/fetch/$s_!BKmt!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6696ec79-bac8-4c68-b74f-271283700936_1262x1302.png 1272w, https://substackcdn.com/image/fetch/$s_!BKmt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6696ec79-bac8-4c68-b74f-271283700936_1262x1302.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>Common Failure Modes</h3><p>Here are some production failures you could come across in production.</p><h5>Out-of-memory Errors</h5><p>This is the most common failure mode. You&#8217;ll see an error message like:</p><pre><code>RuntimeError: CUDA out of memory. Tried to allocate 2.00 GiB 
(GPU 0; 23.70 GiB total capacity; 21.80 GiB already allocated)</code></pre><p>This means a request tried to allocate more memory than available. The GPU had 21.8GB already in use, tried to allocate another 2GB, but only had 1.9GB free.</p><p>There&#8217;s usually three reasons why this happens.</p><p>First, you set <code>--max-model-len </code>too high. You told vLLM that requests can use up to 8192 tokens, so it reserves enough memory to handle 8192-token requests in each batch slot. But your GPU doesn&#8217;t have enough memory for multiple 8192-token requests at once. <strong>Solution:</strong> lower <code>--max-model-len</code> based on your actual use case.</p><p>Second, you set <code>--gpu-memory-utilization</code> too high. You told vLLM to use 95% of GPU memory, which doesn&#8217;t leave enough buffer for CUDA operations and PyTorch overhead. <strong>Solution:</strong> lower it to 0.85 or 0.9.</p><p>Third, you got unlucky. Most requests are short, but occasionally someone sends a 5000-token input. That one request needs way more memory than usual and causes an OOM. <strong>Solution:</strong> set a maximum input length in your application layer before requests reach vLLM. Reject requests over your limit with a clear error message.</p><p>When vLLM hits an OOM error, it usually crashes completely. Your Docker container&#8217;s <code>restart: unless-stopped</code> policy will automatically restart it, but you&#8217;ve lost all in-flight requests. Users get errors. Not great.</p><p>The best solution is to prevent OOM errors in the first place by tuning your memory settings conservatively.</p><h5>Slow Startup and Cold Starts</h5><p>The first request to vLLM after startup takes 30-60 seconds to complete. This is because vLLM needs to load the model weights into GPU memory, initialize CUDA kernels, and warm up the inference pipeline.</p><p>Subsequent requests are fast, but that first request is painfully slow. If you&#8217;re running health checks, that first health check might fail because it times out before vLLM finishes initializing.</p><p><strong>Solution:</strong> send a warmup request after vLLM starts up. In your startup script or Docker entrypoint, add a curl command that waits for vLLM to be ready.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6vLY!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb43d0222-7d23-427c-a8cf-01af379652cb_1362x1490.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6vLY!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb43d0222-7d23-427c-a8cf-01af379652cb_1362x1490.png 424w, https://substackcdn.com/image/fetch/$s_!6vLY!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb43d0222-7d23-427c-a8cf-01af379652cb_1362x1490.png 848w, https://substackcdn.com/image/fetch/$s_!6vLY!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb43d0222-7d23-427c-a8cf-01af379652cb_1362x1490.png 1272w, https://substackcdn.com/image/fetch/$s_!6vLY!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb43d0222-7d23-427c-a8cf-01af379652cb_1362x1490.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6vLY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb43d0222-7d23-427c-a8cf-01af379652cb_1362x1490.png" width="1362" height="1490" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b43d0222-7d23-427c-a8cf-01af379652cb_1362x1490.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1490,&quot;width&quot;:1362,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:304389,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpivot.dev/i/179015509?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb43d0222-7d23-427c-a8cf-01af379652cb_1362x1490.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!6vLY!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb43d0222-7d23-427c-a8cf-01af379652cb_1362x1490.png 424w, https://substackcdn.com/image/fetch/$s_!6vLY!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb43d0222-7d23-427c-a8cf-01af379652cb_1362x1490.png 848w, https://substackcdn.com/image/fetch/$s_!6vLY!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb43d0222-7d23-427c-a8cf-01af379652cb_1362x1490.png 1272w, https://substackcdn.com/image/fetch/$s_!6vLY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb43d0222-7d23-427c-a8cf-01af379652cb_1362x1490.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This script starts vLLM in the background, waits for the health endpoint to respond, sends a warmup request to load everything into memory, and then declares vLLM ready. </p><h5>Performance Degrading Over Time</h5><p>Sometimes vLLM starts out fast but gets slower over hours or days. Your p50 latency starts at 200ms, but after 12 hours it&#8217;s 400ms. After 24 hours it&#8217;s 600ms.</p><p>This usually indicates memory fragmentation or a memory leak. As vLLM processes millions of requests, the GPU memory gets fragmented (lots of small free chunks instead of large contiguous blocks). This makes it harder to allocate memory for new requests and slows things down.</p><p><strong>Solution:</strong> restart vLLM periodically.</p><p>You can automate this with a cron job or your orchestration platform (Kubernetes can do rolling restarts automatically). Just make sure you&#8217;re doing rolling restarts (restart one instance at a time) so you don&#8217;t lose all capacity at once.</p><h5>The GPU is There But vLLM Can&#8217;t See It</h5><p>Sometimes Docker doesn&#8217;t properly pass GPU access to the container. vLLM starts up but immediately errors out with:</p><pre><code>RuntimeError: No CUDA GPUs are available</code></pre><p>This happens when:</p><ul><li><p>You forgot the <code>deploy.resources.reservations.devices</code> section in docker-compose.yml</p></li><li><p>NVIDIA Docker runtime isn&#8217;t installed on the host</p></li><li><p>The NVIDIA drivers on the host are outdated or broken</p></li></ul><p>Check that NVIDIA Docker is working on your host:</p><pre><code>docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi</code></pre><p>If this command fails, Docker can&#8217;t access your GPU. You need to install the NVIDIA Container Toolkit. If it works, the problem is with your docker-compose configuration.</p><p>These are some issues I&#8217;ve hit in production. Monitor for them proactively instead of discovering them when users are complaining. Set up alerts for OOM errors, high queue depth, and latency spikes. Check your GPU utilization daily. Restart instances during off-hours to prevent performance degradation.</p><p>Production isn&#8217;t glamorous, but it&#8217;s where your self-hosted LLM either proves its worth or becomes a maintenance nightmare. Put in the monitoring work upfront and you&#8217;ll sleep better at night.</p><h2>Fallback Strategies</h2><p>Never depend on 100% uptime of any single service. It doesn&#8217;t matter how reliable that service is. It will go down at the worst possible time.</p><h3>Multi-Tier Fallback Architecture</h3><p>The idea behind multi-tier fallback is simple: have multiple ways to get an answer, ordered by preference. Try the cheapest, fastest option first. If that fails, try the next option. Keep going until you get a result or run out of options.</p><p>Here&#8217;s an example.</p><p><strong>Tier 1 is your self-hosted vLLM instance.</strong> This is your primary service. It&#8217;s fast, cheap, and handles 95% of requests under normal conditions. When everything is working correctly, all your traffic goes here. You&#8217;re paying $727/month for infrastructure and processing requests at $0.000088 each.</p><p><strong>Tier 2 is OpenAI&#8217;s API. </strong>This is your fallback when vLLM is down, overloaded, or timing out. It&#8217;s more expensive ($0.00275 per request for GPT-4o), but it&#8217;s incredibly reliable. Under normal conditions, only 4-5% of your requests should hit this tier. That&#8217;s okay. You&#8217;re paying a small premium for reliability insurance.</p><p><strong>Tier 3 is cached responses.</strong> For common queries that don&#8217;t change often, you can return cached results from previous requests. This is your emergency fallback. If both vLLM and OpenAI are unreachable (which is extremely rare), you can at least return something for common cases. Maybe 1% of requests use this tier.</p><h5>Implementing Multi-Tier Fallback</h5><p>Here&#8217;s an example on how to code this pattern.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Fsqo!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9d76855-e9c4-41a2-8f0a-4501f4a89171_1640x3798.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Fsqo!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9d76855-e9c4-41a2-8f0a-4501f4a89171_1640x3798.png 424w, https://substackcdn.com/image/fetch/$s_!Fsqo!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9d76855-e9c4-41a2-8f0a-4501f4a89171_1640x3798.png 848w, https://substackcdn.com/image/fetch/$s_!Fsqo!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9d76855-e9c4-41a2-8f0a-4501f4a89171_1640x3798.png 1272w, https://substackcdn.com/image/fetch/$s_!Fsqo!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9d76855-e9c4-41a2-8f0a-4501f4a89171_1640x3798.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Fsqo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9d76855-e9c4-41a2-8f0a-4501f4a89171_1640x3798.png" width="1456" height="3372" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c9d76855-e9c4-41a2-8f0a-4501f4a89171_1640x3798.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:3372,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:943966,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpivot.dev/i/179015509?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9d76855-e9c4-41a2-8f0a-4501f4a89171_1640x3798.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Fsqo!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9d76855-e9c4-41a2-8f0a-4501f4a89171_1640x3798.png 424w, https://substackcdn.com/image/fetch/$s_!Fsqo!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9d76855-e9c4-41a2-8f0a-4501f4a89171_1640x3798.png 848w, https://substackcdn.com/image/fetch/$s_!Fsqo!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9d76855-e9c4-41a2-8f0a-4501f4a89171_1640x3798.png 1272w, https://substackcdn.com/image/fetch/$s_!Fsqo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9d76855-e9c4-41a2-8f0a-4501f4a89171_1640x3798.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The function first tries vLLM with a 5-second timeout. If vLLM responds within 5 seconds, great. You return the result and you&#8217;re done. That&#8217;s the happy path that happens 95% of the time.</p><p>If vLLM takes longer than 5 seconds (timeout) or throws an exception (crash, out of memory, network error), you immediately fall back to OpenAI. You&#8217;re not making the user wait while vLLM struggles. Five seconds is already too long from a user experience perspective.</p><p>If OpenAI also fails (which is rare but possible during major outages), you check your cache for a similar request. Maybe someone analyzed this exact resume yesterday and you cached the result. Better to return a slightly stale result than nothing at all.</p><p>The cache is your emergency parachute. It won&#8217;t work for unique inputs, but for common queries (like &#8220;analyze this standard software engineer resume&#8221;), you might have a cached response. It&#8217;s better than showing users an error page.</p><h3>The Circuit Breaker Pattern</h3><p>Fallback tiers handle individual request failures, but what happens when vLLM is completely dead? You don&#8217;t want to keep trying it for every request, waiting 5 seconds for a timeout each time, then falling back to OpenAI. That&#8217;s 5 seconds of unnecessary delay per request.</p><p>This is where the circuit breaker pattern comes in. It&#8217;s like the circuit breaker in your house&#8217;s electrical panel. When there&#8217;s a problem (too much current), the breaker trips and stops trying to send electricity until you manually reset it. In software, a circuit breaker does the same thing: it stops trying a failing service and automatically recovers when the service is healthy again.</p><p>Here&#8217;s how it works:</p><p>When vLLM is healthy, the circuit breaker is &#8220;closed&#8221; (electricity flows). Requests go to vLLM normally.</p><p>When vLLM starts failing repeatedly (say, 5 failures in a row), the circuit breaker &#8220;opens&#8221; (electricity stops). Now all requests immediately skip vLLM and go straight to OpenAI. You&#8217;re not wasting time trying a service you know is down.</p><p>After a timeout period (say, 60 seconds), the circuit breaker goes into &#8220;half-open&#8221; state. It tries one request to vLLM to see if it&#8217;s recovered. If that request succeeds, the circuit closes and traffic resumes normally. If it fails, the circuit stays open for another 60 seconds.</p><p>This prevents cascading failures. Without a circuit breaker, you&#8217;d keep hammering vLLM with requests while it&#8217;s struggling, making the problem worse. With a circuit breaker, you give vLLM time to recover while your users get fast responses from OpenAI.</p><h5>Implementing a Circuit Breaker</h5><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Yjm0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F76c45765-1964-4c89-a6be-d03b161f292d_1640x2754.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Yjm0!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F76c45765-1964-4c89-a6be-d03b161f292d_1640x2754.png 424w, https://substackcdn.com/image/fetch/$s_!Yjm0!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F76c45765-1964-4c89-a6be-d03b161f292d_1640x2754.png 848w, https://substackcdn.com/image/fetch/$s_!Yjm0!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F76c45765-1964-4c89-a6be-d03b161f292d_1640x2754.png 1272w, https://substackcdn.com/image/fetch/$s_!Yjm0!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F76c45765-1964-4c89-a6be-d03b161f292d_1640x2754.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Yjm0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F76c45765-1964-4c89-a6be-d03b161f292d_1640x2754.png" width="1456" height="2445" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/76c45765-1964-4c89-a6be-d03b161f292d_1640x2754.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:2445,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:684126,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpivot.dev/i/179015509?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F76c45765-1964-4c89-a6be-d03b161f292d_1640x2754.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Yjm0!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F76c45765-1964-4c89-a6be-d03b161f292d_1640x2754.png 424w, https://substackcdn.com/image/fetch/$s_!Yjm0!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F76c45765-1964-4c89-a6be-d03b161f292d_1640x2754.png 848w, https://substackcdn.com/image/fetch/$s_!Yjm0!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F76c45765-1964-4c89-a6be-d03b161f292d_1640x2754.png 1272w, https://substackcdn.com/image/fetch/$s_!Yjm0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F76c45765-1964-4c89-a6be-d03b161f292d_1640x2754.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The circuit breaker prevents you from wasting resources on a service that&#8217;s clearly broken. If vLLM fails 5 times in a row, something is seriously wrong. Stop trying. Give it a minute to recover. Your users get fast responses from OpenAI instead of waiting 5 seconds per request for vLLM to timeout.</p><h3>Monitoring Fallback Usage</h3><p>Once you&#8217;ve implemented fallbacks, you need to monitor which tier is serving requests. If 50% of your traffic is hitting OpenAI instead of vLLM, something is wrong with your vLLM setup. You&#8217;re paying way more than you budgeted for.</p><p>Track fallback metrics using Prometheus (or whatever monitoring system you use).</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Gb1t!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff79a0be8-5d21-41bd-b64a-c66384967980_1640x1266.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Gb1t!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff79a0be8-5d21-41bd-b64a-c66384967980_1640x1266.png 424w, https://substackcdn.com/image/fetch/$s_!Gb1t!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff79a0be8-5d21-41bd-b64a-c66384967980_1640x1266.png 848w, https://substackcdn.com/image/fetch/$s_!Gb1t!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff79a0be8-5d21-41bd-b64a-c66384967980_1640x1266.png 1272w, https://substackcdn.com/image/fetch/$s_!Gb1t!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff79a0be8-5d21-41bd-b64a-c66384967980_1640x1266.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Gb1t!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff79a0be8-5d21-41bd-b64a-c66384967980_1640x1266.png" width="1456" height="1124" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f79a0be8-5d21-41bd-b64a-c66384967980_1640x1266.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1124,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:296371,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpivot.dev/i/179015509?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff79a0be8-5d21-41bd-b64a-c66384967980_1640x1266.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Gb1t!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff79a0be8-5d21-41bd-b64a-c66384967980_1640x1266.png 424w, https://substackcdn.com/image/fetch/$s_!Gb1t!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff79a0be8-5d21-41bd-b64a-c66384967980_1640x1266.png 848w, https://substackcdn.com/image/fetch/$s_!Gb1t!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff79a0be8-5d21-41bd-b64a-c66384967980_1640x1266.png 1272w, https://substackcdn.com/image/fetch/$s_!Gb1t!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff79a0be8-5d21-41bd-b64a-c66384967980_1640x1266.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h5>Set Up Alerts for Abnormal Fallback Rates</h5><p>If more than 10% of requests are using OpenAI fallback, something is wrong with vLLM. Maybe it&#8217;s hitting OOM errors, maybe your GPU is overloaded, maybe there&#8217;s a network issue. Investigate immediately before your costs spiral.</p><p>If any requests hit cache fallback, notify someone immediately. Cache fallback means both vLLM and OpenAI failed. That&#8217;s a critical outage. Users are getting degraded service (stale cached responses or errors).</p><p>I monitor these metrics in Grafana with alert rules.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!E5OD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d211ca4-3106-4286-8887-fb0a63b46898_1566x744.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!E5OD!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d211ca4-3106-4286-8887-fb0a63b46898_1566x744.png 424w, https://substackcdn.com/image/fetch/$s_!E5OD!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d211ca4-3106-4286-8887-fb0a63b46898_1566x744.png 848w, https://substackcdn.com/image/fetch/$s_!E5OD!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d211ca4-3106-4286-8887-fb0a63b46898_1566x744.png 1272w, https://substackcdn.com/image/fetch/$s_!E5OD!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d211ca4-3106-4286-8887-fb0a63b46898_1566x744.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!E5OD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d211ca4-3106-4286-8887-fb0a63b46898_1566x744.png" width="1456" height="692" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4d211ca4-3106-4286-8887-fb0a63b46898_1566x744.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:692,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:176052,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpivot.dev/i/179015509?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d211ca4-3106-4286-8887-fb0a63b46898_1566x744.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!E5OD!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d211ca4-3106-4286-8887-fb0a63b46898_1566x744.png 424w, https://substackcdn.com/image/fetch/$s_!E5OD!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d211ca4-3106-4286-8887-fb0a63b46898_1566x744.png 848w, https://substackcdn.com/image/fetch/$s_!E5OD!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d211ca4-3106-4286-8887-fb0a63b46898_1566x744.png 1272w, https://substackcdn.com/image/fetch/$s_!E5OD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d211ca4-3106-4286-8887-fb0a63b46898_1566x744.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The first alert fires if more than 10% of requests hit OpenAI for more than 5 minutes. That&#8217;s abnormal and indicates a problem with vLLM.</p><p>The second alert fires immediately if any request hits cache in the last 5 minutes. That&#8217;s critical because it means both vLLM and OpenAI failed.</p><h5>Understanding Your Fallback Costs</h5><p>Let&#8217;s do some math to understand the cost impact of fallbacks.</p><p>Assume you&#8217;re processing 1 million requests per month. Under normal conditions:</p><ul><li><p>950,000 requests go to vLLM at $0.000088 each = $83.60</p></li><li><p>50,000 requests fall back to OpenAI at $0.00275 each = $137.50</p></li><li><p>Total: $221.10 per month</p></li></ul><p>That&#8217;s still way cheaper than running everything through OpenAI (which would cost $2,750 for 1 million requests).</p><p>But what if vLLM is down 10% of the time?</p><ul><li><p>900,000 requests go to vLLM at $0.000088 each = $79.20</p></li><li><p>100,000 requests fall back to OpenAI at $0.00275 each = $275</p></li><li><p>Total: $354.20 per month</p></li></ul><p>Your costs went up by 60%, but you&#8217;re still operational. Without fallbacks, you&#8217;d have 10% downtime and angry users. The extra $133 per month is cheap insurance.</p><p>This is why monitoring fallback rates matters. If your fallback rate is creeping up, you&#8217;re paying more than you expected. Fix vLLM before costs become a problem.</p><h3>Final Thoughts on Resilience</h3><p>Building resilient systems isn&#8217;t glamorous. It&#8217;s extra work upfront to handle edge cases that might never happen. But when they do happen, you&#8217;ll be glad you invested the time.</p><p>Your users don&#8217;t care if you&#8217;re saving 97% on infrastructure costs by self-hosting. They care that your application works when they need it. Multi-tier fallbacks, circuit breakers, and monitoring give you both: cost savings during normal operation and reliability during failures.</p><p>That&#8217;s the whole point of self-hosting with proper fallbacks. You get the economics of running your own infrastructure with the reliability of managed services.</p><h2>Wrapping Up</h2><p>We&#8217;ve covered a lot of ground in this four-part series. If you&#8217;ve made it this far, you now know more about self-hosting LLMs than most engineers working with AI today. That&#8217;s not hyperbole. Most teams jump straight to OpenAI&#8217;s API without ever questioning whether there&#8217;s a better way.</p><p>Let me recap what we&#8217;ve built and why it matters.</p><h4>The Economics Make sense at Scale</h4><p>In Part 1, we did the math. Self-hosting becomes economical above roughly 264,000 requests per month. Below that threshold, stick with the API. Above it, you&#8217;re potentially wasting thousands of dollars per month by not self-hosting.</p><p>However, the exact breakeven point doesn&#8217;t matter as much as understanding your usage patterns. If you&#8217;re processing 200,000 requests per month now but growing 20% month over month, you&#8217;ll hit the breakeven point in two months. Start planning your self-hosting infrastructure now, not when you&#8217;re already bleeding money on API costs.</p><h4>vLLM is Genuinely Impressive Technology</h4><p>The efficiency gains from vLLM aren&#8217;t marketing hype. PagedAttention, continuous batching, and optimized CUDA kernels combine to give you 2-4x better throughput compared to naive PyTorch implementations.</p><p>The technology works. Trust the benchmarks, but verify them with your own workload.</p><h4>Production is Where Theory Meets Reality</h4><p>Everything we covered in the <strong>Monitoring and Maintenance</strong> section about Docker, monitoring, and fallbacks might seem like overkill when you&#8217;re just testing vLLM locally. It&#8217;s not. Production is where all your assumptions get stress-tested by real users with real deadlines.</p><p>Build resilience from day one. Set up monitoring before you launch. Implement fallbacks before you need them. </p><h4>BAML Abstractions Are the Secret Weapon</h4><p>The ability to swap between self-hosted vLLM and OpenAI&#8217;s API with just a configuration change is not just convenient. It&#8217;s strategically important.</p><p>Without BAML, migrating from OpenAI to vLLM would mean rewriting your entire application. Changing every API call, updating response parsing logic, handling different error formats, rewriting tests. That&#8217;s weeks of engineering work and substantial risk.</p><p>With BAML, it&#8217;s a one-line change: <code>client GPT4</code> becomes <code>client LocalVLLM</code>. Everything else stays the same. Your prompts don&#8217;t change. Your type definitions don&#8217;t change. Your application logic doesn&#8217;t change.</p><p>This gives you flexibility. You can test vLLM during development, use OpenAI for staging, and run different models for different functions in production. You can implement fallback logic where cheap requests go to vLLM and complex requests go to GPT-4. You can experiment without committing to a complete rewrite.</p><p>That abstraction work we did in Part 2 pays for itself many times over. It&#8217;s the difference between being locked into a vendor and having options.</p><h4>Key Principles Worth Remembering</h4><p>Throughout this series, a few themes kept coming up. These are worth internalizing:</p><p>Always do the cost analysis for your specific situation. The numbers I provided are examples, not gospel. Your infrastructure costs might be different. Your request volume might follow different patterns. Your quality requirements might justify different model choices. Plug in your actual numbers and make decisions based on your reality, not my examples.</p><p>Build fallback strategies before you need them. Every production system fails eventually. Your self-hosted infrastructure will go down. OpenAI&#8217;s API will have an outage. Your circuit breaker will trip. When these things happen, having fallbacks means your users never notice. Not having fallbacks means you&#8217;re scrambling to fix things while your application is broken.</p><p>Monitor everything you care about. You can&#8217;t fix what you can&#8217;t see. Track your GPU utilization, memory usage, request queue depth, and latency percentiles. Set up alerts before problems become outages. I&#8217;ve spent too many hours debugging issues that would have been obvious if I&#8217;d just looked at the right metrics earlier.</p><p>Start simple and graduate to complexity. Begin with OpenAI&#8217;s API. When you hit the economic threshold, move to self-hosting. Start with a 7B model before jumping to 70B. Run one vLLM instance before implementing multi-instance load balancing. Every layer of complexity you add creates new failure modes. Add complexity only when you have a specific problem to solve.</p><h4>Where to Go From Here</h4><p>If you&#8217;re just starting your LLM journey, stick with OpenAI&#8217;s API for now. Focus on building your application and understanding your users&#8217; needs. Once you&#8217;re processing hundreds of thousands of requests per month, come back to this series and start planning your self-hosting migration.</p><p>If you&#8217;re already at scale and paying too much for API calls, start with a cost analysis. Figure out your breakeven point. Then work through Part 2 to set up BAML abstractions. Finally, use the <strong>Monitoring and Maintenance</strong> section as a deployment guide. Take it step by step.</p><p>If you&#8217;ve been running self-hosted LLMs but struggling with reliability, focus on monitoring and fallback strategies. Resilience engineering isn&#8217;t glamorous, but it&#8217;s what separates hobby projects from production systems.</p><h4>My Final Thoughts</h4><p>Writing this series taught me that self-hosting LLMs is no longer a bleeding-edge experiment for companies with unlimited engineering resources. The tooling has matured. vLLM is production-ready. The economics make sense at reasonable scale.</p><p>But it&#8217;s also not a silver bullet. Self-hosting introduces operational complexity. You&#8217;re trading API costs for infrastructure management, monitoring, and on-call responsibilities. That tradeoff makes sense for some teams and not for others.</p><p>The key is knowing when you&#8217;re on which side of that line. If you&#8217;re processing 50,000 requests per month, stay with the API. If you&#8217;re processing 5 million requests per month and haven&#8217;t considered self-hosting, you&#8217;re probably overpaying by tens of thousands of dollars.</p><p>Do the math. Build the abstractions. Plan for failure. Monitor everything. And don&#8217;t be afraid to experiment. The worst that happens is you learn something new about your system.</p><p>Good luck with your self-hosting journey.</p><p>Happy savings &#128176;</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://mlpivot.dev/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Testing Your LLM Applications (Without Going Broke)]]></title><description><![CDATA[Traditional software is deterministic. Same input, same output, every time. LLMs? Non-deterministic by design. Same input, different output. Even with temperature=0, you get variations. However, there are ways to successfully test your code and in this post you'll learn them.]]></description><link>https://mlpivot.dev/p/testing-your-llm-applications</link><guid isPermaLink="false">https://mlpivot.dev/p/testing-your-llm-applications</guid><dc:creator><![CDATA[The Machine Learning Pivot]]></dc:creator><pubDate>Fri, 21 Nov 2025 23:57:32 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/191cceb8-c5fe-4489-a04d-9443f0972265_2048x2048.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>A few weeks into my LLM project, I needed to write tests for a resume parsing function I was building. Not the &#8220;run it manually and hope all is well&#8221; kind of testing - actual automated tests that could run in CI/CD.</p><p>The function called OpenAI&#8217;s API. Which costs money &#129297;. And returns different outputs each time.</p><p>My first attempt? I just called the API directly in my tests. Wrote five test cases, ran pytest, watched my terminal fill with API calls. Tests took 45 seconds. Then I checked my OpenAI dashboard: $0.50 gone.</p><p>That adds up fast. Run tests 20 times a day during development? $10/day. $50/week. $200/month. That&#8217;s just me. Scale to a team of five? $1,000/month on test runs.</p><p>There&#8217;s a better way.</p><h2>What We&#8217;re Fixing</h2><p>Quick recap from Part 1: one of our problems was that we couldn&#8217;t test LLM code without literally calling the API. That made our tests expensive, slow, and flaky.</p><p>Now that we&#8217;ve got structured prompts and type safety from Part 2, we can build a proper testing strategy. Today we&#8217;ll:</p><ul><li><p>Learn how to test LLM code without calling APIs (mocking and test doubles).</p></li><li><p>Understand the difference between unit tests and integration tests (and when you need each).</p></li><li><p>Test that your prompts truly work (evaluation suites).</p></li><li><p>Set up test infrastructure that doesn&#8217;t suck.</p></li></ul><p>Most of what I&#8217;m covering works whether you use BAML or not. These are software engineering principles that apply to any LLM application. BAML just makes some of it easier.</p><h2>Why Testing LLMs is Different</h2><p>Traditional software is deterministic. Same input, (hopefully) same output, every time.</p><pre><code>def add(a, b):
    return a + b

# This passes every single time
assert add(2, 3) == 5</code></pre><p>LLMs? Non-deterministic by design. Same input, different output. Even with <code>temperature=0</code>, you get variations.</p><p>And it gets weirder: with traditional software, you test your code. With LLMs, you&#8217;re also testing your prompts. Is the bug in your code or your prompt? Sometimes you can&#8217;t tell.</p><p>I&#8217;ve seen tests that pass 8 times out of 10. The other 2? The LLM formats the response slightly differently and the parsing breaks. That&#8217;s not a code bug. That&#8217;s a prompt engineering problem disguised as a testing problem.</p><p>So we need a testing strategy that handles both: testing our code works correctly AND testing our prompts produce useful outputs.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://mlpivot.dev/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>Universal Testing Principles</h2><p>I&#8217;ll show you the core testing principles first, using plain Python. This works whether you&#8217;re using BAML, LangChain, raw OpenAI calls, or building your own abstractions.</p><h3>Don&#8217;t Test the LLM, Test Your Code</h3><p>You don&#8217;t need to test the LLM itself. OpenAI already tested GPT-4. You need to test YOUR code that wraps the LLM.</p><p>Here&#8217;s our original resume analyzer (simplified) &#128071;&#127998;</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!YrAp!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc615dba9-3386-41fe-9d6f-71a6db32f377_1228x818.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!YrAp!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc615dba9-3386-41fe-9d6f-71a6db32f377_1228x818.png 424w, https://substackcdn.com/image/fetch/$s_!YrAp!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc615dba9-3386-41fe-9d6f-71a6db32f377_1228x818.png 848w, https://substackcdn.com/image/fetch/$s_!YrAp!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc615dba9-3386-41fe-9d6f-71a6db32f377_1228x818.png 1272w, https://substackcdn.com/image/fetch/$s_!YrAp!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc615dba9-3386-41fe-9d6f-71a6db32f377_1228x818.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!YrAp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc615dba9-3386-41fe-9d6f-71a6db32f377_1228x818.png" width="1228" height="818" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c615dba9-3386-41fe-9d6f-71a6db32f377_1228x818.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:818,&quot;width&quot;:1228,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:157298,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpivot.dev/i/178934180?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc615dba9-3386-41fe-9d6f-71a6db32f377_1228x818.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!YrAp!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc615dba9-3386-41fe-9d6f-71a6db32f377_1228x818.png 424w, https://substackcdn.com/image/fetch/$s_!YrAp!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc615dba9-3386-41fe-9d6f-71a6db32f377_1228x818.png 848w, https://substackcdn.com/image/fetch/$s_!YrAp!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc615dba9-3386-41fe-9d6f-71a6db32f377_1228x818.png 1272w, https://substackcdn.com/image/fetch/$s_!YrAp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc615dba9-3386-41fe-9d6f-71a6db32f377_1228x818.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The problem: every time you test this, you hit the OpenAI API. Expensive and slow.</p><p>The solution: dependency injection. Make the LLM client something you can swap out.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!nLIi!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F167ce525-0940-4ed0-97c0-71c093480668_1616x2196.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!nLIi!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F167ce525-0940-4ed0-97c0-71c093480668_1616x2196.png 424w, https://substackcdn.com/image/fetch/$s_!nLIi!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F167ce525-0940-4ed0-97c0-71c093480668_1616x2196.png 848w, https://substackcdn.com/image/fetch/$s_!nLIi!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F167ce525-0940-4ed0-97c0-71c093480668_1616x2196.png 1272w, https://substackcdn.com/image/fetch/$s_!nLIi!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F167ce525-0940-4ed0-97c0-71c093480668_1616x2196.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!nLIi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F167ce525-0940-4ed0-97c0-71c093480668_1616x2196.png" width="1456" height="1979" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/167ce525-0940-4ed0-97c0-71c093480668_1616x2196.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1979,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:465103,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpivot.dev/i/178934180?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F167ce525-0940-4ed0-97c0-71c093480668_1616x2196.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!nLIi!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F167ce525-0940-4ed0-97c0-71c093480668_1616x2196.png 424w, https://substackcdn.com/image/fetch/$s_!nLIi!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F167ce525-0940-4ed0-97c0-71c093480668_1616x2196.png 848w, https://substackcdn.com/image/fetch/$s_!nLIi!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F167ce525-0940-4ed0-97c0-71c093480668_1616x2196.png 1272w, https://substackcdn.com/image/fetch/$s_!nLIi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F167ce525-0940-4ed0-97c0-71c093480668_1616x2196.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Now in your tests, pass in a fake client.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!jg-P!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa83c2ecf-1a5f-46e8-9d51-fae84a79661f_1514x1116.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!jg-P!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa83c2ecf-1a5f-46e8-9d51-fae84a79661f_1514x1116.png 424w, https://substackcdn.com/image/fetch/$s_!jg-P!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa83c2ecf-1a5f-46e8-9d51-fae84a79661f_1514x1116.png 848w, https://substackcdn.com/image/fetch/$s_!jg-P!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa83c2ecf-1a5f-46e8-9d51-fae84a79661f_1514x1116.png 1272w, https://substackcdn.com/image/fetch/$s_!jg-P!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa83c2ecf-1a5f-46e8-9d51-fae84a79661f_1514x1116.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!jg-P!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa83c2ecf-1a5f-46e8-9d51-fae84a79661f_1514x1116.png" width="1456" height="1073" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a83c2ecf-1a5f-46e8-9d51-fae84a79661f_1514x1116.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1073,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:227333,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpivot.dev/i/178934180?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa83c2ecf-1a5f-46e8-9d51-fae84a79661f_1514x1116.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!jg-P!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa83c2ecf-1a5f-46e8-9d51-fae84a79661f_1514x1116.png 424w, https://substackcdn.com/image/fetch/$s_!jg-P!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa83c2ecf-1a5f-46e8-9d51-fae84a79661f_1514x1116.png 848w, https://substackcdn.com/image/fetch/$s_!jg-P!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa83c2ecf-1a5f-46e8-9d51-fae84a79661f_1514x1116.png 1272w, https://substackcdn.com/image/fetch/$s_!jg-P!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa83c2ecf-1a5f-46e8-9d51-fae84a79661f_1514x1116.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>No API calls. No cost. Runs in milliseconds.</p><p>This is the fundamental pattern: separate the &#8220;call the LLM&#8221; part from the &#8220;process the result&#8221; part.</p><h3>Build Different Mocks for Different Scenarios</h3><p>One mock isn&#8217;t enough. You need mocks for different scenarios.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!K3Vw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F158c79f9-56ec-480a-938e-618eefddfb3a_2048x1786.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!K3Vw!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F158c79f9-56ec-480a-938e-618eefddfb3a_2048x1786.png 424w, https://substackcdn.com/image/fetch/$s_!K3Vw!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F158c79f9-56ec-480a-938e-618eefddfb3a_2048x1786.png 848w, https://substackcdn.com/image/fetch/$s_!K3Vw!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F158c79f9-56ec-480a-938e-618eefddfb3a_2048x1786.png 1272w, https://substackcdn.com/image/fetch/$s_!K3Vw!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F158c79f9-56ec-480a-938e-618eefddfb3a_2048x1786.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!K3Vw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F158c79f9-56ec-480a-938e-618eefddfb3a_2048x1786.png" width="1456" height="1270" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/158c79f9-56ec-480a-938e-618eefddfb3a_2048x1786.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1270,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:381424,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpivot.dev/i/178934180?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F158c79f9-56ec-480a-938e-618eefddfb3a_2048x1786.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!K3Vw!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F158c79f9-56ec-480a-938e-618eefddfb3a_2048x1786.png 424w, https://substackcdn.com/image/fetch/$s_!K3Vw!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F158c79f9-56ec-480a-938e-618eefddfb3a_2048x1786.png 848w, https://substackcdn.com/image/fetch/$s_!K3Vw!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F158c79f9-56ec-480a-938e-618eefddfb3a_2048x1786.png 1272w, https://substackcdn.com/image/fetch/$s_!K3Vw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F158c79f9-56ec-480a-938e-618eefddfb3a_2048x1786.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Now you can test all the edge cases.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!hXCx!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42696d3b-619f-4204-9ffb-d2ed50219c0b_1346x1154.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!hXCx!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42696d3b-619f-4204-9ffb-d2ed50219c0b_1346x1154.png 424w, https://substackcdn.com/image/fetch/$s_!hXCx!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42696d3b-619f-4204-9ffb-d2ed50219c0b_1346x1154.png 848w, https://substackcdn.com/image/fetch/$s_!hXCx!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42696d3b-619f-4204-9ffb-d2ed50219c0b_1346x1154.png 1272w, https://substackcdn.com/image/fetch/$s_!hXCx!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42696d3b-619f-4204-9ffb-d2ed50219c0b_1346x1154.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!hXCx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42696d3b-619f-4204-9ffb-d2ed50219c0b_1346x1154.png" width="1346" height="1154" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/42696d3b-619f-4204-9ffb-d2ed50219c0b_1346x1154.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1154,&quot;width&quot;:1346,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:273284,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpivot.dev/i/178934180?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42696d3b-619f-4204-9ffb-d2ed50219c0b_1346x1154.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!hXCx!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42696d3b-619f-4204-9ffb-d2ed50219c0b_1346x1154.png 424w, https://substackcdn.com/image/fetch/$s_!hXCx!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42696d3b-619f-4204-9ffb-d2ed50219c0b_1346x1154.png 848w, https://substackcdn.com/image/fetch/$s_!hXCx!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42696d3b-619f-4204-9ffb-d2ed50219c0b_1346x1154.png 1272w, https://substackcdn.com/image/fetch/$s_!hXCx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42696d3b-619f-4204-9ffb-d2ed50219c0b_1346x1154.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This builds confidence that your error handling actually works, without spending money on API calls.</p><blockquote><p><strong>&#128221; Note:</strong> Testing error handling is especially important for LLM applications because LLMs can return unexpected formats. If your error handling isn&#8217;t tested, you won&#8217;t discover the bugs until production (been there, learned that lesson the hard way).</p></blockquote><h3>Property-based Testing</h3><p>Sometimes you don&#8217;t care about the exact output. You care about properties of the output.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!QgIR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63c9314b-a9f2-4d6d-b794-73a1039611d5_1446x1340.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!QgIR!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63c9314b-a9f2-4d6d-b794-73a1039611d5_1446x1340.png 424w, https://substackcdn.com/image/fetch/$s_!QgIR!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63c9314b-a9f2-4d6d-b794-73a1039611d5_1446x1340.png 848w, https://substackcdn.com/image/fetch/$s_!QgIR!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63c9314b-a9f2-4d6d-b794-73a1039611d5_1446x1340.png 1272w, https://substackcdn.com/image/fetch/$s_!QgIR!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63c9314b-a9f2-4d6d-b794-73a1039611d5_1446x1340.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!QgIR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63c9314b-a9f2-4d6d-b794-73a1039611d5_1446x1340.png" width="1446" height="1340" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/63c9314b-a9f2-4d6d-b794-73a1039611d5_1446x1340.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1340,&quot;width&quot;:1446,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:308091,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpivot.dev/i/178934180?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63c9314b-a9f2-4d6d-b794-73a1039611d5_1446x1340.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!QgIR!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63c9314b-a9f2-4d6d-b794-73a1039611d5_1446x1340.png 424w, https://substackcdn.com/image/fetch/$s_!QgIR!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63c9314b-a9f2-4d6d-b794-73a1039611d5_1446x1340.png 848w, https://substackcdn.com/image/fetch/$s_!QgIR!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63c9314b-a9f2-4d6d-b794-73a1039611d5_1446x1340.png 1272w, https://substackcdn.com/image/fetch/$s_!QgIR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63c9314b-a9f2-4d6d-b794-73a1039611d5_1446x1340.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This type of testing is about invariants: things that should ALWAYS be true. It&#8217;s less brittle than testing exact outputs, which is perfect for LLM applications where exact outputs vary.</p><h2>Integration Tests vs Unit Tests</h2><p>Mocks are great. But at some point, you need to test that your code works with the real API.</p><p><strong>Unit tests:</strong> Fast, cheap, test your code in isolation using mocks. Run these on every commit.</p><p><strong>Integration tests:</strong> Slow, expensive, test that your code actually works with real APIs. Run these less frequently (once a day or before releases).</p><p>Here&#8217;s how I&#8217;d structure these tests &#128071;&#127998;</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!56pl!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bb22caa-7fdb-4212-9717-ec5758d07317_1178x1414.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!56pl!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bb22caa-7fdb-4212-9717-ec5758d07317_1178x1414.png 424w, https://substackcdn.com/image/fetch/$s_!56pl!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bb22caa-7fdb-4212-9717-ec5758d07317_1178x1414.png 848w, https://substackcdn.com/image/fetch/$s_!56pl!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bb22caa-7fdb-4212-9717-ec5758d07317_1178x1414.png 1272w, https://substackcdn.com/image/fetch/$s_!56pl!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bb22caa-7fdb-4212-9717-ec5758d07317_1178x1414.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!56pl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bb22caa-7fdb-4212-9717-ec5758d07317_1178x1414.png" width="1178" height="1414" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9bb22caa-7fdb-4212-9717-ec5758d07317_1178x1414.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1414,&quot;width&quot;:1178,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:306298,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpivot.dev/i/178934180?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bb22caa-7fdb-4212-9717-ec5758d07317_1178x1414.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!56pl!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bb22caa-7fdb-4212-9717-ec5758d07317_1178x1414.png 424w, https://substackcdn.com/image/fetch/$s_!56pl!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bb22caa-7fdb-4212-9717-ec5758d07317_1178x1414.png 848w, https://substackcdn.com/image/fetch/$s_!56pl!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bb22caa-7fdb-4212-9717-ec5758d07317_1178x1414.png 1272w, https://substackcdn.com/image/fetch/$s_!56pl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9bb22caa-7fdb-4212-9717-ec5758d07317_1178x1414.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h4>Running Tests Locally</h4><p>During normal development, you want fast feedback. Run just the unit tests:</p><pre><code>pytest</code></pre><p>This runs in seconds, costs nothing, and catches most bugs. You&#8217;ll run this dozens of times a day as you write code.</p><p>When you&#8217;re about to commit or merge a feature, verify the integration works.</p><pre><code>RUN_INTEGRATION_TESTS=1 pytest</code></pre><p>This hits the OpenAI API. Takes a minute or two. With 5-10 integration tests, you&#8217;re looking at $0.10-0.50 per run depending on your prompt complexity and which model you&#8217;re using. You run this maybe once or twice before pushing your code.</p><p>You can also run just the integration tests.</p><pre><code>RUN_INTEGRATION_TESTS=1 pytest -m integration</code></pre><p>This is useful when you&#8217;ve changed your prompt and want to verify it works with the real API, but don&#8217;t want to wait for all the unit tests to run again.</p><h4>Setting up CI/CD</h4><p>In CI/CD, run unit tests on every PR, but integration tests only on the main branch or before releases.</p><p>This way you&#8217;re not burning money on API calls during development, but you still have confidence that the integration works.</p><h2>Evaluation Suites for Prompt Quality</h2><p>You need to test your prompts separately from your code.</p><p>A prompt evaluation suite is a collection of real examples with known correct outputs. Like a test dataset, but for your prompts.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_0R5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60ae9cde-70f0-4726-81e6-caccd7d87094_1430x1414.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_0R5!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60ae9cde-70f0-4726-81e6-caccd7d87094_1430x1414.png 424w, https://substackcdn.com/image/fetch/$s_!_0R5!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60ae9cde-70f0-4726-81e6-caccd7d87094_1430x1414.png 848w, https://substackcdn.com/image/fetch/$s_!_0R5!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60ae9cde-70f0-4726-81e6-caccd7d87094_1430x1414.png 1272w, https://substackcdn.com/image/fetch/$s_!_0R5!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60ae9cde-70f0-4726-81e6-caccd7d87094_1430x1414.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_0R5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60ae9cde-70f0-4726-81e6-caccd7d87094_1430x1414.png" width="1430" height="1414" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/60ae9cde-70f0-4726-81e6-caccd7d87094_1430x1414.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1414,&quot;width&quot;:1430,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:243355,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpivot.dev/i/178934180?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60ae9cde-70f0-4726-81e6-caccd7d87094_1430x1414.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!_0R5!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60ae9cde-70f0-4726-81e6-caccd7d87094_1430x1414.png 424w, https://substackcdn.com/image/fetch/$s_!_0R5!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60ae9cde-70f0-4726-81e6-caccd7d87094_1430x1414.png 848w, https://substackcdn.com/image/fetch/$s_!_0R5!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60ae9cde-70f0-4726-81e6-caccd7d87094_1430x1414.png 1272w, https://substackcdn.com/image/fetch/$s_!_0R5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60ae9cde-70f0-4726-81e6-caccd7d87094_1430x1414.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Test your prompt against all of them.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!WO1l!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37b3d960-c00b-40f8-9245-f03e9f52235b_1582x968.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!WO1l!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37b3d960-c00b-40f8-9245-f03e9f52235b_1582x968.png 424w, https://substackcdn.com/image/fetch/$s_!WO1l!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37b3d960-c00b-40f8-9245-f03e9f52235b_1582x968.png 848w, https://substackcdn.com/image/fetch/$s_!WO1l!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37b3d960-c00b-40f8-9245-f03e9f52235b_1582x968.png 1272w, https://substackcdn.com/image/fetch/$s_!WO1l!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37b3d960-c00b-40f8-9245-f03e9f52235b_1582x968.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!WO1l!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37b3d960-c00b-40f8-9245-f03e9f52235b_1582x968.png" width="1456" height="891" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/37b3d960-c00b-40f8-9245-f03e9f52235b_1582x968.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:891,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:248835,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpivot.dev/i/178934180?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37b3d960-c00b-40f8-9245-f03e9f52235b_1582x968.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!WO1l!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37b3d960-c00b-40f8-9245-f03e9f52235b_1582x968.png 424w, https://substackcdn.com/image/fetch/$s_!WO1l!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37b3d960-c00b-40f8-9245-f03e9f52235b_1582x968.png 848w, https://substackcdn.com/image/fetch/$s_!WO1l!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37b3d960-c00b-40f8-9245-f03e9f52235b_1582x968.png 1272w, https://substackcdn.com/image/fetch/$s_!WO1l!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37b3d960-c00b-40f8-9245-f03e9f52235b_1582x968.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>When you change your prompt, run this evaluation suite. If your pass rate goes down, you made the prompt worse. If it goes up, you made it better. Iterate on prompts scientifically instead of guessing.</p><blockquote><p><strong>&#128161; Pro Tip:</strong> Start your evaluation suite small (5-10 cases) and add to it over time. Don&#8217;t try to create a comprehensive eval suite on day one.</p></blockquote><h2>Regression Testing</h2><p>Every time you find a bug in production, add it to your test suite.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!NmVV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cd8e2ff-5e98-4806-9bc9-3e137ab5df77_1834x1414.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!NmVV!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cd8e2ff-5e98-4806-9bc9-3e137ab5df77_1834x1414.png 424w, https://substackcdn.com/image/fetch/$s_!NmVV!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cd8e2ff-5e98-4806-9bc9-3e137ab5df77_1834x1414.png 848w, https://substackcdn.com/image/fetch/$s_!NmVV!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cd8e2ff-5e98-4806-9bc9-3e137ab5df77_1834x1414.png 1272w, https://substackcdn.com/image/fetch/$s_!NmVV!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cd8e2ff-5e98-4806-9bc9-3e137ab5df77_1834x1414.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!NmVV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cd8e2ff-5e98-4806-9bc9-3e137ab5df77_1834x1414.png" width="1456" height="1123" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7cd8e2ff-5e98-4806-9bc9-3e137ab5df77_1834x1414.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1123,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:329353,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpivot.dev/i/178934180?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cd8e2ff-5e98-4806-9bc9-3e137ab5df77_1834x1414.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!NmVV!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cd8e2ff-5e98-4806-9bc9-3e137ab5df77_1834x1414.png 424w, https://substackcdn.com/image/fetch/$s_!NmVV!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cd8e2ff-5e98-4806-9bc9-3e137ab5df77_1834x1414.png 848w, https://substackcdn.com/image/fetch/$s_!NmVV!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cd8e2ff-5e98-4806-9bc9-3e137ab5df77_1834x1414.png 1272w, https://substackcdn.com/image/fetch/$s_!NmVV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7cd8e2ff-5e98-4806-9bc9-3e137ab5df77_1834x1414.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>These tests document what went wrong and ensure it doesn&#8217;t happen again. Plus, they&#8217;re documentation for your team about quirks in the LLM&#8217;s behavior.</p><h2>Setting Up Test Infrastructure</h2><h4>Directory Structure</h4><pre><code>my-llm-project/
&#9500;&#9472;&#9472; baml_src/
&#9474;   &#9492;&#9472;&#9472; resume.baml
&#9500;&#9472;&#9472; src/
&#9474;   &#9492;&#9472;&#9472; resume_analyzer.py
&#9500;&#9472;&#9472; tests/
&#9474;   &#9500;&#9472;&#9472; unit/
&#9474;   &#9474;   &#9500;&#9472;&#9472; test_resume_analyzer.py
&#9474;   &#9474;   &#9492;&#9472;&#9472; mocks.py
&#9474;   &#9500;&#9472;&#9472; integration/
&#9474;   &#9474;   &#9492;&#9472;&#9472; test_resume_analyzer_integration.py
&#9474;   &#9492;&#9472;&#9472; evaluation/
&#9474;       &#9500;&#9472;&#9472; test_prompt_quality.py
&#9474;       &#9492;&#9472;&#9472; evaluation_suite.py
&#9500;&#9472;&#9472; pytest.ini
&#9492;&#9472;&#9472; requirements-dev.txt
</code></pre><h4>pytest.ini</h4><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-7li!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdbbc4011-902d-4484-9e67-b364271f30db_1178x670.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-7li!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdbbc4011-902d-4484-9e67-b364271f30db_1178x670.png 424w, https://substackcdn.com/image/fetch/$s_!-7li!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdbbc4011-902d-4484-9e67-b364271f30db_1178x670.png 848w, https://substackcdn.com/image/fetch/$s_!-7li!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdbbc4011-902d-4484-9e67-b364271f30db_1178x670.png 1272w, https://substackcdn.com/image/fetch/$s_!-7li!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdbbc4011-902d-4484-9e67-b364271f30db_1178x670.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-7li!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdbbc4011-902d-4484-9e67-b364271f30db_1178x670.png" width="1178" height="670" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dbbc4011-902d-4484-9e67-b364271f30db_1178x670.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:670,&quot;width&quot;:1178,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:118107,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpivot.dev/i/178934180?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdbbc4011-902d-4484-9e67-b364271f30db_1178x670.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!-7li!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdbbc4011-902d-4484-9e67-b364271f30db_1178x670.png 424w, https://substackcdn.com/image/fetch/$s_!-7li!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdbbc4011-902d-4484-9e67-b364271f30db_1178x670.png 848w, https://substackcdn.com/image/fetch/$s_!-7li!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdbbc4011-902d-4484-9e67-b364271f30db_1178x670.png 1272w, https://substackcdn.com/image/fetch/$s_!-7li!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdbbc4011-902d-4484-9e67-b364271f30db_1178x670.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h4>Running Different Test Suites</h4><pre><code># Default: only unit tests
pytest

# Include integration tests
pytest -m &#8220;unit or integration&#8221;

# Run full evaluation suite
pytest -m evaluation

# Run everything
pytest -m &#8220;&#8221;</code></pre><h4>In CI/CD (GitHub Actions)</h4><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!XQsV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8f7b3ff-f869-4fd9-b1ad-ec551ed90af3_1228x1154.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!XQsV!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8f7b3ff-f869-4fd9-b1ad-ec551ed90af3_1228x1154.png 424w, https://substackcdn.com/image/fetch/$s_!XQsV!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8f7b3ff-f869-4fd9-b1ad-ec551ed90af3_1228x1154.png 848w, https://substackcdn.com/image/fetch/$s_!XQsV!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8f7b3ff-f869-4fd9-b1ad-ec551ed90af3_1228x1154.png 1272w, https://substackcdn.com/image/fetch/$s_!XQsV!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8f7b3ff-f869-4fd9-b1ad-ec551ed90af3_1228x1154.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!XQsV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8f7b3ff-f869-4fd9-b1ad-ec551ed90af3_1228x1154.png" width="1228" height="1154" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d8f7b3ff-f869-4fd9-b1ad-ec551ed90af3_1228x1154.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1154,&quot;width&quot;:1228,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:180706,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpivot.dev/i/178934180?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8f7b3ff-f869-4fd9-b1ad-ec551ed90af3_1228x1154.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!XQsV!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8f7b3ff-f869-4fd9-b1ad-ec551ed90af3_1228x1154.png 424w, https://substackcdn.com/image/fetch/$s_!XQsV!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8f7b3ff-f869-4fd9-b1ad-ec551ed90af3_1228x1154.png 848w, https://substackcdn.com/image/fetch/$s_!XQsV!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8f7b3ff-f869-4fd9-b1ad-ec551ed90af3_1228x1154.png 1272w, https://substackcdn.com/image/fetch/$s_!XQsV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8f7b3ff-f869-4fd9-b1ad-ec551ed90af3_1228x1154.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This setup gives you fast feedback during development (unit tests) while still having confidence that integration works (integration tests), without burning through your API budget. (No more wasting $5-10 every time someone pushes a typo fix &#128293;&#128176;).</p><h2>What Good Looks Like</h2><p>My rule of thumb for a well-tested LLM application is:</p><ul><li><p><strong>80%+ unit test coverage</strong> - Test your code logic with mocks.</p></li><li><p><strong>5-10 integration tests</strong> - Verify real API calls work for critical paths.</p></li><li><p><strong>10-20 evaluation cases</strong> - Test prompt quality against real examples.</p></li><li><p><strong>Regression tests for every production bug</strong> - Ensure bugs stay fixed.</p></li></ul><p>Unit tests run in seconds and cost nothing. Integration tests run in minutes and cost a few cents. Evaluation tests cost a bit more but give you confidence in your prompts.</p><p>The mock-based approach is both cheaper AND more thorough than running all tests against the real API.</p><h2>Your Homework</h2><p>Pick that function you converted to BAML in Part 2 (or any LLM function you have). Write:</p><ol><li><p><strong>One happy path unit test</strong> - Mock a perfect response, verify your code handles it.</p></li><li><p><strong>One error handling unit test</strong> - Mock a bad response, verify your error handling works.</p></li><li><p><strong>One property-based unit test</strong> - Test that certain properties are always true.</p></li><li><p><strong>One integration test</strong> - Call the API with a real example.</p></li></ol><p>If you don&#8217;t have a function yet, use this example.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!77EE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc894d3d8-d267-4eb5-8680-d7d9392c16e6_1734x520.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!77EE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc894d3d8-d267-4eb5-8680-d7d9392c16e6_1734x520.png 424w, https://substackcdn.com/image/fetch/$s_!77EE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc894d3d8-d267-4eb5-8680-d7d9392c16e6_1734x520.png 848w, https://substackcdn.com/image/fetch/$s_!77EE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc894d3d8-d267-4eb5-8680-d7d9392c16e6_1734x520.png 1272w, https://substackcdn.com/image/fetch/$s_!77EE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc894d3d8-d267-4eb5-8680-d7d9392c16e6_1734x520.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!77EE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc894d3d8-d267-4eb5-8680-d7d9392c16e6_1734x520.png" width="1456" height="437" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c894d3d8-d267-4eb5-8680-d7d9392c16e6_1734x520.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:437,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:135211,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpivot.dev/i/178934180?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc894d3d8-d267-4eb5-8680-d7d9392c16e6_1734x520.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!77EE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc894d3d8-d267-4eb5-8680-d7d9392c16e6_1734x520.png 424w, https://substackcdn.com/image/fetch/$s_!77EE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc894d3d8-d267-4eb5-8680-d7d9392c16e6_1734x520.png 848w, https://substackcdn.com/image/fetch/$s_!77EE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc894d3d8-d267-4eb5-8680-d7d9392c16e6_1734x520.png 1272w, https://substackcdn.com/image/fetch/$s_!77EE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc894d3d8-d267-4eb5-8680-d7d9392c16e6_1734x520.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Write those four tests. I mean it! The muscle memory matters.</p><blockquote><p><strong>&#11088;&#65039; Bonus</strong>: Set up pytest markers so you can run your unit tests separately from integration tests. This will save you time and money as your test suite grows.</p></blockquote><p>Next post: Part 4 on self-hosting LLMs with vLLM. Because once you&#8217;ve got type-safe prompts (Part 2) and comprehensive tests (Part 3), you might start thinking &#8220;what if I didn&#8217;t have to pay OpenAI for every single request?&#8221;</p><p>See you there &#129406;</p><p></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://mlpivot.dev/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Stop Guessing What Your LLM Returns]]></title><description><![CDATA[If you don't know what datatypes your LLM function returns, you're basically playing Russian roulette during runtime &#128552;]]></description><link>https://mlpivot.dev/p/stop-guessing-what-your-llm-returns</link><guid isPermaLink="false">https://mlpivot.dev/p/stop-guessing-what-your-llm-returns</guid><dc:creator><![CDATA[The Machine Learning Pivot]]></dc:creator><pubDate>Tue, 18 Nov 2025 13:43:32 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/c63c27d3-67de-42dd-94ee-923287c6d005_1408x768.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>A week after that sprint planning meeting where I couldn&#8217;t estimate my LLM work, I thought I&#8217;d gotten my act together. I&#8217;d refactored the prompt string into a function, moved the API key to an environment variable, and felt pretty accomplished. Then I ran the code on a new batch of resumes.</p><p>It crashed.</p><p>My downstream processing code expected <code>years_of_experience</code> to be an integer. The LLM returned &#8220;5-7 years&#8221; for someone with a range on their resume. My code that did <code>if years_of_experience &gt; 3</code> threw a <code>TypeError</code> and took down the whole pipeline.</p><p>This wasn&#8217;t an LLM problem. My code had no way to enforce what the LLM should return. I was hoping it would always give me an integer. When it didn&#8217;t, I had no safety net.</p><p>That search for a better approach led me to BAML.</p><h2>The Two Problems We&#8217;re Fixing</h2><p>From Part 1, we&#8217;re tackling two specific issues &#128071;&#127998;.</p><p><strong>No type safety.</strong> We don&#8217;t know what our LLM functions return. Is it a dict? What keys? What types? Your IDE can&#8217;t help you, and you won&#8217;t find out until runtime.</p><p><strong>String manipulation chaos.</strong> Our prompts are scattered f-strings buried in functions. We can&#8217;t version them, can&#8217;t reuse them easily, can&#8217;t A/B test them methodically.</p><p>I&#8217;m showing you BAML because it solves these problems elegantly. But the principles (type safety, separation of concerns, treating prompts as first-class citizens) matter more than any specific tool. If BAML disappeared tomorrow, you could apply these same principles with other tools.</p><h2>What BAML Does</h2><p>BAML comes from Boundary ML. It stands for <em>Basically a Made-up Language.</em></p><p>Here&#8217;s how BAML works: you define your prompts and their expected output schemas in a separate <code>.baml</code> file. BAML generates type-safe Python (or TypeScript) code. Your IDE knows what types you&#8217;re working with. You get autocomplete. You get actual contracts that the LLM needs to fulfill.</p><p>The alternative? What we did in Part 1. Hope that your string manipulation produces valid prompts and that the LLM returns something usable.</p><p>When I first heard about BAML, I thought &#8220;do we really need a new language just for prompts?&#8221;</p><p>But it&#8217;s not about the language syntax. It&#8217;s about having a place where prompts live separately from your application code. When your prompts are scattered across ten different Python files as f-strings, nobody knows which version runs in production. BAML forces you to put them all in one place, give them types, and version them properly.</p><p>Could you do this without BAML? Sure. But you probably won&#8217;t because it&#8217;s annoying to set up from scratch.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://mlpivot.dev/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>Installing BAML</h2><p>Assuming you&#8217;ve got Python 3.8 or higher, run <code>pip install baml-py</code>.</p><p>If you use VSCode, grab the BAML extension from the marketplace. The syntax highlighting alone is worth it, but you also get inline validation that catches errors before you run the code.</p><blockquote><p><strong>&#128221; Note:</strong> The VSCode extension makes a huge difference in your workflow. If you&#8217;re not a VSCode person, BAML will still work fine. You just won&#8217;t get those annoying (but important) red squiggly lines when you mess up the syntax.</p></blockquote><h3>Converting Our Messy Notebook to BAML</h3><p>Remember the janky resume analyzer from Part 1?</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gDyM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe728fa75-d60e-4584-b55e-30b81cbf945b_1616x1638.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gDyM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe728fa75-d60e-4584-b55e-30b81cbf945b_1616x1638.png 424w, https://substackcdn.com/image/fetch/$s_!gDyM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe728fa75-d60e-4584-b55e-30b81cbf945b_1616x1638.png 848w, https://substackcdn.com/image/fetch/$s_!gDyM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe728fa75-d60e-4584-b55e-30b81cbf945b_1616x1638.png 1272w, https://substackcdn.com/image/fetch/$s_!gDyM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe728fa75-d60e-4584-b55e-30b81cbf945b_1616x1638.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gDyM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe728fa75-d60e-4584-b55e-30b81cbf945b_1616x1638.png" width="1456" height="1476" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e728fa75-d60e-4584-b55e-30b81cbf945b_1616x1638.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1476,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:308522,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpivot.dev/i/178830755?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe728fa75-d60e-4584-b55e-30b81cbf945b_1616x1638.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!gDyM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe728fa75-d60e-4584-b55e-30b81cbf945b_1616x1638.png 424w, https://substackcdn.com/image/fetch/$s_!gDyM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe728fa75-d60e-4584-b55e-30b81cbf945b_1616x1638.png 848w, https://substackcdn.com/image/fetch/$s_!gDyM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe728fa75-d60e-4584-b55e-30b81cbf945b_1616x1638.png 1272w, https://substackcdn.com/image/fetch/$s_!gDyM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe728fa75-d60e-4584-b55e-30b81cbf945b_1616x1638.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The biggest issue: we&#8217;re treating the prompt like a throwaway string, and we don&#8217;t know what structure that JSON will have when it comes back.</p><h3>Creating a BAML File</h3><p>First, set up your project structure.</p><pre><code>my-llm-project/
&#9500;&#9472;&#9472; baml_src/
&#9474;   &#9492;&#9472;&#9472; resume.baml
&#9500;&#9472;&#9472; main.py
&#9500;&#9472;&#9472; .env
&#9492;&#9472;&#9472; requirements.txt</code></pre><p>Inside <code>baml_src/resume.baml</code>, define your types and prompt function.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!zA7r!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9790efd-a012-405f-80ee-e66169fdc98c_1590x1128.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!zA7r!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9790efd-a012-405f-80ee-e66169fdc98c_1590x1128.png 424w, https://substackcdn.com/image/fetch/$s_!zA7r!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9790efd-a012-405f-80ee-e66169fdc98c_1590x1128.png 848w, https://substackcdn.com/image/fetch/$s_!zA7r!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9790efd-a012-405f-80ee-e66169fdc98c_1590x1128.png 1272w, https://substackcdn.com/image/fetch/$s_!zA7r!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9790efd-a012-405f-80ee-e66169fdc98c_1590x1128.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!zA7r!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9790efd-a012-405f-80ee-e66169fdc98c_1590x1128.png" width="1456" height="1033" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f9790efd-a012-405f-80ee-e66169fdc98c_1590x1128.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1033,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:194520,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpivot.dev/i/178830755?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9790efd-a012-405f-80ee-e66169fdc98c_1590x1128.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!zA7r!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9790efd-a012-405f-80ee-e66169fdc98c_1590x1128.png 424w, https://substackcdn.com/image/fetch/$s_!zA7r!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9790efd-a012-405f-80ee-e66169fdc98c_1590x1128.png 848w, https://substackcdn.com/image/fetch/$s_!zA7r!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9790efd-a012-405f-80ee-e66169fdc98c_1590x1128.png 1272w, https://substackcdn.com/image/fetch/$s_!zA7r!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9790efd-a012-405f-80ee-e66169fdc98c_1590x1128.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>We defined a <code>class ResumeAnalysis</code>. This is our schema and our contract. When this function runs, it will return an object with these fields and these types. Not &#8220;hopefully returns&#8221; or &#8220;usually returns&#8221;. Will return.</p><p>Notice I added a clarification in the prompt: &#8220;if there&#8217;s a range, use the minimum.&#8221; Remember my crash? This prompt clarification prevents that crash. We&#8217;re explicit about what we expect.</p><p>The <code>{{ ctx.output_format }}</code> at the end is BAML magic. It automatically generates the JSON schema instructions for the LLM based on your class definition. You don&#8217;t have to manually write &#8220;return this as JSON with these exact fields.&#8221; BAML handles that for you.</p><blockquote><p>&#9888;&#65039; The prompt syntax uses <code>{{ variable_name }}</code> for variable interpolation, not Python&#8217;s f-string style.</p></blockquote><h3>Configuring Your LLM Client</h3><p>Before our function works, we need to tell BAML how to call GPT-4. Create another file in <code>baml_src/</code> called <code>clients.baml</code>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!aSSa!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F413766c8-0155-4102-a144-3098d917499c_968x416.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!aSSa!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F413766c8-0155-4102-a144-3098d917499c_968x416.png 424w, https://substackcdn.com/image/fetch/$s_!aSSa!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F413766c8-0155-4102-a144-3098d917499c_968x416.png 848w, https://substackcdn.com/image/fetch/$s_!aSSa!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F413766c8-0155-4102-a144-3098d917499c_968x416.png 1272w, https://substackcdn.com/image/fetch/$s_!aSSa!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F413766c8-0155-4102-a144-3098d917499c_968x416.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!aSSa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F413766c8-0155-4102-a144-3098d917499c_968x416.png" width="968" height="416" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/413766c8-0155-4102-a144-3098d917499c_968x416.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:416,&quot;width&quot;:968,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:56302,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpivot.dev/i/178830755?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F413766c8-0155-4102-a144-3098d917499c_968x416.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!aSSa!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F413766c8-0155-4102-a144-3098d917499c_968x416.png 424w, https://substackcdn.com/image/fetch/$s_!aSSa!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F413766c8-0155-4102-a144-3098d917499c_968x416.png 848w, https://substackcdn.com/image/fetch/$s_!aSSa!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F413766c8-0155-4102-a144-3098d917499c_968x416.png 1272w, https://substackcdn.com/image/fetch/$s_!aSSa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F413766c8-0155-4102-a144-3098d917499c_968x416.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>We separated the configuration (which model, which API key, what temperature) from the function definition. This is separation of concerns in action. If you want to swap GPT-4 for GPT-5, or switch to a different provider entirely, you can change it in one place. Your function definitions don&#8217;t need to know or care about which model runs.</p><h3>Generating the Python Code</h3><p>Run the BAML compiler.</p><pre><code><code>baml-cli generate</code></code></pre><p>This creates a <code>baml_client/</code> directory with generated Python code. Don&#8217;t edit these files manually (they&#8217;ll get overwritten next time you run the generator), but do check them into version control.</p><h3>Using It in Your Python Code</h3><p>Open up <code>main.py</code>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Gc91!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63c123aa-baaf-4038-bdea-eab5a44c4b2a_1952x1750.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Gc91!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63c123aa-baaf-4038-bdea-eab5a44c4b2a_1952x1750.png 424w, https://substackcdn.com/image/fetch/$s_!Gc91!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63c123aa-baaf-4038-bdea-eab5a44c4b2a_1952x1750.png 848w, https://substackcdn.com/image/fetch/$s_!Gc91!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63c123aa-baaf-4038-bdea-eab5a44c4b2a_1952x1750.png 1272w, https://substackcdn.com/image/fetch/$s_!Gc91!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63c123aa-baaf-4038-bdea-eab5a44c4b2a_1952x1750.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Gc91!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63c123aa-baaf-4038-bdea-eab5a44c4b2a_1952x1750.png" width="1456" height="1305" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/63c123aa-baaf-4038-bdea-eab5a44c4b2a_1952x1750.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1305,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:415151,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpivot.dev/i/178830755?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63c123aa-baaf-4038-bdea-eab5a44c4b2a_1952x1750.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Gc91!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63c123aa-baaf-4038-bdea-eab5a44c4b2a_1952x1750.png 424w, https://substackcdn.com/image/fetch/$s_!Gc91!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63c123aa-baaf-4038-bdea-eab5a44c4b2a_1952x1750.png 848w, https://substackcdn.com/image/fetch/$s_!Gc91!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63c123aa-baaf-4038-bdea-eab5a44c4b2a_1952x1750.png 1272w, https://substackcdn.com/image/fetch/$s_!Gc91!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63c123aa-baaf-4038-bdea-eab5a44c4b2a_1952x1750.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Look at what we just did.</p><p><code>analysis.candidate_name</code> - Your IDE knows this exists.</p><p><code>analysis.years_of_experience </code>- Your IDE knows this is an int.</p><p><code>analysis.skills</code> - Your IDE knows this is a list of strings.</p><p>If you try to access a field that doesn&#8217;t exist, your IDE yells at you before you run the code. If you try to do math on <code>candidate_name</code>, your type checker catches it.</p><p>This is type safety.</p><h2>Handling Complex Nested Structures</h2><p>The resume analysis returns flat data: a name, a number, a list of strings. Most real-world LLM tasks aren&#8217;t that simple.</p><p>Say you&#8217;re analyzing meeting transcripts. You need action items, but each action item has its own structure: a task, who it&#8217;s assigned to, a due date, a priority level. That&#8217;s nested data.</p><p>Here&#8217;s how you handle it in BAML. First, define the nested structure.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!S7or!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29985bdb-3a00-4fa5-819e-e1ba3ab7da5f_1324x532.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!S7or!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29985bdb-3a00-4fa5-819e-e1ba3ab7da5f_1324x532.png 424w, https://substackcdn.com/image/fetch/$s_!S7or!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29985bdb-3a00-4fa5-819e-e1ba3ab7da5f_1324x532.png 848w, https://substackcdn.com/image/fetch/$s_!S7or!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29985bdb-3a00-4fa5-819e-e1ba3ab7da5f_1324x532.png 1272w, https://substackcdn.com/image/fetch/$s_!S7or!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29985bdb-3a00-4fa5-819e-e1ba3ab7da5f_1324x532.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!S7or!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29985bdb-3a00-4fa5-819e-e1ba3ab7da5f_1324x532.png" width="1324" height="532" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/29985bdb-3a00-4fa5-819e-e1ba3ab7da5f_1324x532.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:532,&quot;width&quot;:1324,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:70665,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpivot.dev/i/178830755?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29985bdb-3a00-4fa5-819e-e1ba3ab7da5f_1324x532.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!S7or!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29985bdb-3a00-4fa5-819e-e1ba3ab7da5f_1324x532.png 424w, https://substackcdn.com/image/fetch/$s_!S7or!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29985bdb-3a00-4fa5-819e-e1ba3ab7da5f_1324x532.png 848w, https://substackcdn.com/image/fetch/$s_!S7or!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29985bdb-3a00-4fa5-819e-e1ba3ab7da5f_1324x532.png 1272w, https://substackcdn.com/image/fetch/$s_!S7or!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29985bdb-3a00-4fa5-819e-e1ba3ab7da5f_1324x532.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Each action item is its own object with typed fields. The <code>Priority</code> enum ensures the LLM can only return those four values, not random strings like &#8220;kinda important".</p><p>Now define what a full meeting analysis looks like.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!hc7X!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47a937b7-007c-46c4-b368-e270745468be_1238x314.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!hc7X!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47a937b7-007c-46c4-b368-e270745468be_1238x314.png 424w, https://substackcdn.com/image/fetch/$s_!hc7X!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47a937b7-007c-46c4-b368-e270745468be_1238x314.png 848w, https://substackcdn.com/image/fetch/$s_!hc7X!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47a937b7-007c-46c4-b368-e270745468be_1238x314.png 1272w, https://substackcdn.com/image/fetch/$s_!hc7X!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47a937b7-007c-46c4-b368-e270745468be_1238x314.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!hc7X!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47a937b7-007c-46c4-b368-e270745468be_1238x314.png" width="1238" height="314" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/47a937b7-007c-46c4-b368-e270745468be_1238x314.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:314,&quot;width&quot;:1238,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:64350,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpivot.dev/i/178830755?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47a937b7-007c-46c4-b368-e270745468be_1238x314.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!hc7X!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47a937b7-007c-46c4-b368-e270745468be_1238x314.png 424w, https://substackcdn.com/image/fetch/$s_!hc7X!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47a937b7-007c-46c4-b368-e270745468be_1238x314.png 848w, https://substackcdn.com/image/fetch/$s_!hc7X!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47a937b7-007c-46c4-b368-e270745468be_1238x314.png 1272w, https://substackcdn.com/image/fetch/$s_!hc7X!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47a937b7-007c-46c4-b368-e270745468be_1238x314.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The <code>action_items</code> field is an array of <code>ActionItem</code> objects. Your IDE understands this nesting.</p><p>Here&#8217;s the function &#128071;&#127998;</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!KyAa!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd2332aa4-38ba-4c03-a905-7853f1b0c1cb_1516x992.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!KyAa!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd2332aa4-38ba-4c03-a905-7853f1b0c1cb_1516x992.png 424w, https://substackcdn.com/image/fetch/$s_!KyAa!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd2332aa4-38ba-4c03-a905-7853f1b0c1cb_1516x992.png 848w, https://substackcdn.com/image/fetch/$s_!KyAa!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd2332aa4-38ba-4c03-a905-7853f1b0c1cb_1516x992.png 1272w, https://substackcdn.com/image/fetch/$s_!KyAa!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd2332aa4-38ba-4c03-a905-7853f1b0c1cb_1516x992.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!KyAa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd2332aa4-38ba-4c03-a905-7853f1b0c1cb_1516x992.png" width="1456" height="953" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d2332aa4-38ba-4c03-a905-7853f1b0c1cb_1516x992.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:953,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:193633,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpivot.dev/i/178830755?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd2332aa4-38ba-4c03-a905-7853f1b0c1cb_1516x992.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!KyAa!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd2332aa4-38ba-4c03-a905-7853f1b0c1cb_1516x992.png 424w, https://substackcdn.com/image/fetch/$s_!KyAa!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd2332aa4-38ba-4c03-a905-7853f1b0c1cb_1516x992.png 848w, https://substackcdn.com/image/fetch/$s_!KyAa!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd2332aa4-38ba-4c03-a905-7853f1b0c1cb_1516x992.png 1272w, https://substackcdn.com/image/fetch/$s_!KyAa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd2332aa4-38ba-4c03-a905-7853f1b0c1cb_1516x992.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Using it in Python.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!g_4e!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ec989cb-e2ad-4eee-9071-aa999976ce77_2048x2680.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!g_4e!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ec989cb-e2ad-4eee-9071-aa999976ce77_2048x2680.png 424w, https://substackcdn.com/image/fetch/$s_!g_4e!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ec989cb-e2ad-4eee-9071-aa999976ce77_2048x2680.png 848w, https://substackcdn.com/image/fetch/$s_!g_4e!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ec989cb-e2ad-4eee-9071-aa999976ce77_2048x2680.png 1272w, https://substackcdn.com/image/fetch/$s_!g_4e!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ec989cb-e2ad-4eee-9071-aa999976ce77_2048x2680.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!g_4e!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ec989cb-e2ad-4eee-9071-aa999976ce77_2048x2680.png" width="1456" height="1905" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9ec989cb-e2ad-4eee-9071-aa999976ce77_2048x2680.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1905,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:635629,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpivot.dev/i/178830755?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ec989cb-e2ad-4eee-9071-aa999976ce77_2048x2680.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!g_4e!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ec989cb-e2ad-4eee-9071-aa999976ce77_2048x2680.png 424w, https://substackcdn.com/image/fetch/$s_!g_4e!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ec989cb-e2ad-4eee-9071-aa999976ce77_2048x2680.png 848w, https://substackcdn.com/image/fetch/$s_!g_4e!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ec989cb-e2ad-4eee-9071-aa999976ce77_2048x2680.png 1272w, https://substackcdn.com/image/fetch/$s_!g_4e!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ec989cb-e2ad-4eee-9071-aa999976ce77_2048x2680.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Your IDE knows <code>item</code> is an <code>ActionItem</code>. It knows <code>item.priority</code> is a <code>Priority</code> enum. You get autocomplete for nested fields.</p><h2>Configuration Management Done Right</h2><p>Remember that security nightmare of hardcoded API keys from Part 1? BAML handles this properly.</p><p>You can define multiple clients for different scenarios:</p><pre><code>client&lt;llm&gt; GPT4 {
  provider openai
  options {
    model gpt-4
    api_key env.OPENAI_API_KEY
    temperature 0
  }
}

client&lt;llm&gt; GPT5 {
  provider openai
  options {
    model gpt-5-mini
    api_key env.OPENAI_API_KEY
    temperature 0
  }
}

client&lt;llm&gt; Claude {
  provider anthropic
  options {
    model claude-3-sonnet-20240229
    api_key env.ANTHROPIC_API_KEY
    temperature 0
  }
}</code></pre><p>And in your function definition, you specify which client to use.</p><p>Want to test if Claude gives better results? Change one word. Want to swap out the model for a locally hosted one? Define a new client, point it at your local endpoint, change one word in your function. The rest of your Python code doesn&#8217;t change.</p><p>This is proper abstraction. You&#8217;ve separated what you want to do (analyze a resume) from how you&#8217;re doing it (which model, which provider).</p><p>I usually start with <code>gpt-4.1-nano</code> for development and testing because it&#8217;s cheaper. Once the prompts work well, I switch to <code>gpt-4.1</code> for production. With BAML, that&#8217;s a one-word change in the client configuration.</p><h2>The Principles Matter More than the Tool</h2><p>BAML could disappear tomorrow and these lessons would still be valuable.</p><p>What matters:</p><ul><li><p><strong>Type safety:</strong> Define schemas for your LLM outputs and enforce them.</p></li><li><p><strong>Separation of concerns:</strong> Keep prompts separate from application logic.</p></li><li><p><strong>Testability:</strong> Make it possible to test your prompts in isolation.</p></li><li><p><strong>Versioning:</strong> Treat prompts as code that can be versioned and reviewed.</p></li><li><p><strong>Configuration management:</strong> Separate what you&#8217;re doing from how you&#8217;re doing it.</p></li></ul><p>You could implement these principles with Pydantic models for type safety, a custom prompt management system, a simple YAML file for prompts, or your own abstraction layer.</p><p>BAML makes it easier. But if you walk away from this thinking &#8220;I need to use BAML,&#8221; you&#8217;ve missed the point. Walk away thinking &#8220;I need type safety and structure for my LLM applications,&#8221; then choose whatever tool helps you get there.</p><p>Before BAML, I tried building my own prompt management system with Pydantic. It worked okay, but it was a lot of boilerplate code. BAML handles all that boilerplate for you, which is why I switched. But the underlying principles (schemas, separation of concerns, testability) were the same.</p><h2>The VSCode Playground</h2><p>One more thing about BAML: the VSCode extension gives you a playground where you can test your prompts without running your entire application.</p><p>You can click on a function in your <code>.baml </code>file, click &#8220;Test in Playground,&#8221; and try it out with different inputs. Instead of modifying your prompt, running your Python script, waiting for the API call, checking the output, repeat&#8230;you test right there in VSCode.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!64IX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92786dee-db0d-4021-9193-1cbf56982aea_1458x820.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!64IX!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92786dee-db0d-4021-9193-1cbf56982aea_1458x820.png 424w, https://substackcdn.com/image/fetch/$s_!64IX!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92786dee-db0d-4021-9193-1cbf56982aea_1458x820.png 848w, https://substackcdn.com/image/fetch/$s_!64IX!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92786dee-db0d-4021-9193-1cbf56982aea_1458x820.png 1272w, https://substackcdn.com/image/fetch/$s_!64IX!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92786dee-db0d-4021-9193-1cbf56982aea_1458x820.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!64IX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92786dee-db0d-4021-9193-1cbf56982aea_1458x820.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/92786dee-db0d-4021-9193-1cbf56982aea_1458x820.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:82460,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpivot.dev/i/178830755?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92786dee-db0d-4021-9193-1cbf56982aea_1458x820.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!64IX!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92786dee-db0d-4021-9193-1cbf56982aea_1458x820.png 424w, https://substackcdn.com/image/fetch/$s_!64IX!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92786dee-db0d-4021-9193-1cbf56982aea_1458x820.png 848w, https://substackcdn.com/image/fetch/$s_!64IX!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92786dee-db0d-4021-9193-1cbf56982aea_1458x820.png 1272w, https://substackcdn.com/image/fetch/$s_!64IX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92786dee-db0d-4021-9193-1cbf56982aea_1458x820.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The first time I used this, I saved myself 30 minutes of iteration time. I was tweaking a prompt to get the output format right, and being able to test it directly in VSCode without context-switching to my terminal made a massive difference. </p><h2>Exercise: Try Converting One Prompt</h2><p>Pick one prompt from your codebase. Just one.</p><p>Install BAML (<code>pip install baml-py</code>), create a <code>baml_src</code> directory, define the output schema as a BAML class, convert the prompt to a BAML function, generate the Python code, and replace your old implementation with the new type-safe version.</p><p>When you define that output schema, you&#8217;ll probably realize your prompt was underspecified. You&#8217;ll find yourself adding clarifications like &#8220;if there&#8217;s a range, use the minimum&#8221; or &#8220;return dates in ISO format.&#8221; This is good. This is you thinking more carefully about what you actually want the LLM to do.</p><p>I&#8217;ve done this with probably a dozen prompts, and every time I realize my original prompt was vaguer than I thought. The act of defining a schema forces you to be explicit about your expectations.</p><blockquote><p>&#9888;&#65039; The biggest mistake I see: trying to define overly complex schemas on the first try. Start simple. Get one field working with type safety, then add more fields.</p></blockquote><p>In the next post, we tackle testing. Now that we have type-safe, well-structured prompts, we can test them properly. We&#8217;ll cover mocking, test doubles, and how to build a comprehensive test suite without calling the OpenAI API a thousand times (and racking up a ridiculous bill).</p><p>Happy coding &#128077;&#127998;</p><p></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://mlpivot.dev/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[The Six Problems Hiding in Your LLM Notebook Code]]></title><description><![CDATA[Right now, there's probably a data scientist waking up to a $3,000 OpenAI bill because a bot found their exposed API key and has been hammering GPT-4 endpoints for 12 hours.]]></description><link>https://mlpivot.dev/p/six-problems-with-your-llm-code</link><guid isPermaLink="false">https://mlpivot.dev/p/six-problems-with-your-llm-code</guid><dc:creator><![CDATA[The Machine Learning Pivot]]></dc:creator><pubDate>Sun, 16 Nov 2025 14:14:39 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!f5dz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ec287ea-6223-457d-82d1-991a23e7107d_1024x576.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!f5dz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ec287ea-6223-457d-82d1-991a23e7107d_1024x576.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!f5dz!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ec287ea-6223-457d-82d1-991a23e7107d_1024x576.jpeg 424w, https://substackcdn.com/image/fetch/$s_!f5dz!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ec287ea-6223-457d-82d1-991a23e7107d_1024x576.jpeg 848w, https://substackcdn.com/image/fetch/$s_!f5dz!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ec287ea-6223-457d-82d1-991a23e7107d_1024x576.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!f5dz!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ec287ea-6223-457d-82d1-991a23e7107d_1024x576.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!f5dz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ec287ea-6223-457d-82d1-991a23e7107d_1024x576.jpeg" width="1024" height="576" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4ec287ea-6223-457d-82d1-991a23e7107d_1024x576.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:576,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:62670,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://mlpivot.dev/i/178704630?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ec287ea-6223-457d-82d1-991a23e7107d_1024x576.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!f5dz!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ec287ea-6223-457d-82d1-991a23e7107d_1024x576.jpeg 424w, https://substackcdn.com/image/fetch/$s_!f5dz!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ec287ea-6223-457d-82d1-991a23e7107d_1024x576.jpeg 848w, https://substackcdn.com/image/fetch/$s_!f5dz!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ec287ea-6223-457d-82d1-991a23e7107d_1024x576.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!f5dz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ec287ea-6223-457d-82d1-991a23e7107d_1024x576.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>I&#8217;ll never forget the first time I got an LLM to work locally. I was messing around with HuggingFace&#8217;s transformers library, running a sentiment analysis model on some text data. The model actually worked! It classified my test sentences correctly, the output looked reasonable, and I felt accomplished &#9786;&#65039;!</p><p>Not to mention the timing worked out well. My team had been talking about adding some LLM capability to one of our features for a few sprints. We kept deprioritizing it, moving it around in the backlog. Eventually, my tech lead asked if anyone wanted to spike it and see if it was feasible.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://mlpivot.dev/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>I volunteered. I mean, I&#8217;d just spent a weekend playing with transformers anyway, so why not?</p><p>I spent a few days tinkering in a Jupyter notebook. Got a basic prototype working, and showed it to people on my team during a sync. Everyone thought it was pretty cool. The PM got excited and my manager seemed interested. Apparently this prototype was what we needed.</p><p>Then sprint planning rolled around the next week.</p><p>&#8220;Okay, so the spike is done,&#8221; the PM said, pulling up the ticket. &#8220;How many story points would it take to actually build this properly?&#8221;</p><p>I stared at my screen. My brain went completely blank.</p><p>&#8220;Uh&#8230;more than 5?&#8221; I said. &#8220;Maybe 13? Honestly, I have no idea.&#8221;</p><p>And I really didn&#8217;t. My prototype was a 150-line Jupyter notebook with hardcoded API keys and string manipulation. How much work sits between &#8220;it works in my notebook&#8221; and &#8220;this is production-ready&#8221;?</p><p>Spoiler: a lot.</p><h2>Why Notebooks Fail in Production</h2><p>This isn&#8217;t a criticism of notebooks. I love using Jupyter notebooks or Google Colab! They&#8217;re perfect for experimentation, quick iterations, and showing your work. But there&#8217;s a massive gap between &#8220;it works in my notebook&#8221; and &#8220;this is ready to serve real traffic.&#8221; If you come from a data science or research background, you might not even know what that gap looks like yet.</p><p>This series bridges that gap. We&#8217;re going to take a messy notebook with LLM code and transform it into a production-ready system. No hand-waving, no shortcuts. We&#8217;ll cover type safety, testing, self-hosting, deployment, and everything else that separates a <strong>prototype</strong> from a <strong>product</strong>.</p><p>But first, we need to honestly assess the problem. Let me show you what makes notebook code so problematic for production systems.</p><h2>The Mess We&#8217;re Starting With</h2><p>Here&#8217;s what a typical &#8220;I just need to get this working&#8221; LLM notebook looks like.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!E_Kw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6fe0b553-0338-4bee-8016-5fdd53a71c73_1616x1638.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!E_Kw!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6fe0b553-0338-4bee-8016-5fdd53a71c73_1616x1638.png 424w, https://substackcdn.com/image/fetch/$s_!E_Kw!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6fe0b553-0338-4bee-8016-5fdd53a71c73_1616x1638.png 848w, https://substackcdn.com/image/fetch/$s_!E_Kw!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6fe0b553-0338-4bee-8016-5fdd53a71c73_1616x1638.png 1272w, https://substackcdn.com/image/fetch/$s_!E_Kw!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6fe0b553-0338-4bee-8016-5fdd53a71c73_1616x1638.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!E_Kw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6fe0b553-0338-4bee-8016-5fdd53a71c73_1616x1638.png" width="1456" height="1476" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6fe0b553-0338-4bee-8016-5fdd53a71c73_1616x1638.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1476,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:308522,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpivot.dev/i/178704630?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6fe0b553-0338-4bee-8016-5fdd53a71c73_1616x1638.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!E_Kw!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6fe0b553-0338-4bee-8016-5fdd53a71c73_1616x1638.png 424w, https://substackcdn.com/image/fetch/$s_!E_Kw!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6fe0b553-0338-4bee-8016-5fdd53a71c73_1616x1638.png 848w, https://substackcdn.com/image/fetch/$s_!E_Kw!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6fe0b553-0338-4bee-8016-5fdd53a71c73_1616x1638.png 1272w, https://substackcdn.com/image/fetch/$s_!E_Kw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6fe0b553-0338-4bee-8016-5fdd53a71c73_1616x1638.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>At first glance, this seems fine. It&#8217;s about 30 lines and it works. You can run it, get results, and show your team. So what&#8217;s the problem?</p><p>I&#8217;ll show you.</p><h2>The Type Safety Issue</h2><p>Quick question: what does `analyze_resume()` return?</p><p>You might say &#8220;a dictionary.&#8221; But what keys are in that dictionary? What are their types? Is `years_of_experience` an int or a string? Is `skills` a list of strings or a comma-separated string?</p><p>You have to read the prompt, hope the LLM returns what you asked for, and pray it&#8217;s consistent. There&#8217;s no contract, no schema, no guarantee. Also, your IDE can&#8217;t help you with autocomplete because it has no idea what you&#8217;re working with.</p><p>I&#8217;ve run into this exact issue before. I had a function that was supposed to return a structured response from an LLM, and I just assumed `experience_years` would be an integer. Made sense, right? Turns out the LLM sometimes returned &#8220;5 years&#8221; as a string, sometimes &#8220;5&#8221; as a string, and occasionally (I kid you not) &#8220;five&#8221; spelled out. My downstream code that tried to do `if experience_years &gt; 3` crashed in three different ways depending on which format showed up.</p><p>If I can&#8217;t tell what this function returns without running it, neither can anyone else on my team. When the LLM randomly decided to format the output differently, we&#8217;ll have runtime errors that should&#8217;ve been caught way earlier.</p><h2>The Security Nightmare</h2><p>See the API key on line 5? That&#8217;s a production incident waiting to happen.</p><p>Early in my career, I was reviewing a PR from someone on my team. The code looked fine, tests passed, everything seemed great. But I noticed a hardcoded HuggingFace API token.</p><p>&#8220;Did you push this to the repo?&#8221; I asked.</p><p>&#8220;Yeah, but I&#8217;ll change it right after!&#8221;</p><p>My stomach dropped. &#8220;How many commits ago?&#8221;</p><p>&#8220;Like, two commits back.&#8221;</p><p>Here&#8217;s the thing. Git is like an elephant. It NEVER forgets. Even if you change a secret in a later commit, it&#8217;s still sitting there in your repo&#8217;s history. Anyone who clones the repository can see it with a simple `git log` command. If your repo is public (or gets accidentally made public), that secret is now out in the wild.</p><p>GitHub has secret scanning, but it&#8217;s not instantaneous. By the time you get the notification email, bots have already scraped your token and added it to a database somewhere.</p><p>Right now, there&#8217;s probably a data scientist somewhere waking up to a $3,000 OpenAI bill because a bot found their exposed API key in a public repo and has been hammering GPT-4 endpoints for the past 12 hours. The hacker running that bot doesn&#8217;t care about your project or your career. They just want free compute.</p><h3>What to Do If You&#8217;ve Leaked a Secret</h3><p>If you realized you&#8217;ve committed a secret to Git:</p><ol><li><p>Revoke the key <strong>immediately.</strong></p></li><li><p>Report the incident.</p></li><li><p>Check your billing ASAP (OpenAI keys, AWS keys, anything that costs money).</p></li><li><p>Generate a new one.</p></li><li><p>Use `git filter-branch` or BFG Repo-Cleaner to remove it from history.</p></li><li><p>Force push (coordinate with your team first or get someone from security to help you).</p></li></ol><h3>The Right Way to Handle Secrets</h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_CuX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd50049f-215e-4774-8918-8ebb10066304_1430x818.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_CuX!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd50049f-215e-4774-8918-8ebb10066304_1430x818.png 424w, https://substackcdn.com/image/fetch/$s_!_CuX!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd50049f-215e-4774-8918-8ebb10066304_1430x818.png 848w, https://substackcdn.com/image/fetch/$s_!_CuX!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd50049f-215e-4774-8918-8ebb10066304_1430x818.png 1272w, https://substackcdn.com/image/fetch/$s_!_CuX!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd50049f-215e-4774-8918-8ebb10066304_1430x818.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_CuX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd50049f-215e-4774-8918-8ebb10066304_1430x818.png" width="1430" height="818" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dd50049f-215e-4774-8918-8ebb10066304_1430x818.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:818,&quot;width&quot;:1430,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:164277,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpivot.dev/i/178704630?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd50049f-215e-4774-8918-8ebb10066304_1430x818.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!_CuX!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd50049f-215e-4774-8918-8ebb10066304_1430x818.png 424w, https://substackcdn.com/image/fetch/$s_!_CuX!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd50049f-215e-4774-8918-8ebb10066304_1430x818.png 848w, https://substackcdn.com/image/fetch/$s_!_CuX!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd50049f-215e-4774-8918-8ebb10066304_1430x818.png 1272w, https://substackcdn.com/image/fetch/$s_!_CuX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd50049f-215e-4774-8918-8ebb10066304_1430x818.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><blockquote><p>&#9888;&#65039; And add `.env` to your `.gitignore` file. Always. Every single time. No exceptions. I have this muscle memory now where if I create a new repo, `.gitignore` is literally the first file I create, forget the README.</p></blockquote><h2>The Testing Dilemma</h2><p>How do you test `analyze_resume()`? Right now, you can&#8217;t. Well, you can, but you&#8217;d have to actually call the OpenAI API every single time you run your tests. This creates several problems.</p><ul><li><p>It costs money (every test run hits your API bill).</p></li><li><p>It&#8217;s slow (GPT-4 responses can take several seconds each).</p></li><li><p>It&#8217;s non-deterministic (same inputs potentially lead to different outputs).</p></li><li><p>It&#8217;s fragile (tests fail when OpenAI has an outage).</p></li></ul><p>This function is tightly coupled to the external API. There&#8217;s no way to test your logic without actually calling OpenAI. Your tests are expensive, slow, and flaky. Your CI/CD pipeline would take forever to run, cost a fortune, and randomly fail for reasons that have nothing to do with your code.</p><h2>The String Manipulation Problem</h2><p>Let&#8217;s take a look at that prompt construction again &#128071;&#127998;</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!zShF!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42a0c557-d983-4393-b955-d99cf1d6347e_1548x558.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!zShF!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42a0c557-d983-4393-b955-d99cf1d6347e_1548x558.png 424w, https://substackcdn.com/image/fetch/$s_!zShF!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42a0c557-d983-4393-b955-d99cf1d6347e_1548x558.png 848w, https://substackcdn.com/image/fetch/$s_!zShF!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42a0c557-d983-4393-b955-d99cf1d6347e_1548x558.png 1272w, https://substackcdn.com/image/fetch/$s_!zShF!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42a0c557-d983-4393-b955-d99cf1d6347e_1548x558.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!zShF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42a0c557-d983-4393-b955-d99cf1d6347e_1548x558.png" width="1456" height="525" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/42a0c557-d983-4393-b955-d99cf1d6347e_1548x558.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:525,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:105290,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpivot.dev/i/178704630?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42a0c557-d983-4393-b955-d99cf1d6347e_1548x558.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!zShF!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42a0c557-d983-4393-b955-d99cf1d6347e_1548x558.png 424w, https://substackcdn.com/image/fetch/$s_!zShF!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42a0c557-d983-4393-b955-d99cf1d6347e_1548x558.png 848w, https://substackcdn.com/image/fetch/$s_!zShF!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42a0c557-d983-4393-b955-d99cf1d6347e_1548x558.png 1272w, https://substackcdn.com/image/fetch/$s_!zShF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42a0c557-d983-4393-b955-d99cf1d6347e_1548x558.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This is fine for a quick prototype. But what happens when you need to reuse this prompt logic in three different places? What if you want to A/B test different prompt versions? What if you want to keep track of which prompt version produced which results?</p><p>Right now, your prompt is just a string buried in a function. You can&#8217;t version it, can&#8217;t test different variations systematically, can&#8217;t share it across files without copy-pasting.</p><p>I&#8217;ve been in codebases where we had seven different versions of basically the same prompt because everyone just copied and tweaked the string instead of having a centralized place for prompt templates. When we needed to change something fundamental, we had to hunt through the entire repo. Not fun.</p><h2>The Error Handling Setback</h2><p>What happens when OpenAI&#8217;s API is down? What if you hit rate limits? What if the response isn&#8217;t valid JSON? What if the model returns null for a required field?</p><p>Right now, the answer is: your code crashes.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!3ekT!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbce5625e-4ca2-4f96-af0b-6a35f2beaa67_1210x558.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!3ekT!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbce5625e-4ca2-4f96-af0b-6a35f2beaa67_1210x558.png 424w, https://substackcdn.com/image/fetch/$s_!3ekT!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbce5625e-4ca2-4f96-af0b-6a35f2beaa67_1210x558.png 848w, https://substackcdn.com/image/fetch/$s_!3ekT!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbce5625e-4ca2-4f96-af0b-6a35f2beaa67_1210x558.png 1272w, https://substackcdn.com/image/fetch/$s_!3ekT!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbce5625e-4ca2-4f96-af0b-6a35f2beaa67_1210x558.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!3ekT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbce5625e-4ca2-4f96-af0b-6a35f2beaa67_1210x558.png" width="1210" height="558" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bce5625e-4ca2-4f96-af0b-6a35f2beaa67_1210x558.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:558,&quot;width&quot;:1210,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:113099,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpivot.dev/i/178704630?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbce5625e-4ca2-4f96-af0b-6a35f2beaa67_1210x558.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!3ekT!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbce5625e-4ca2-4f96-af0b-6a35f2beaa67_1210x558.png 424w, https://substackcdn.com/image/fetch/$s_!3ekT!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbce5625e-4ca2-4f96-af0b-6a35f2beaa67_1210x558.png 848w, https://substackcdn.com/image/fetch/$s_!3ekT!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbce5625e-4ca2-4f96-af0b-6a35f2beaa67_1210x558.png 1272w, https://substackcdn.com/image/fetch/$s_!3ekT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbce5625e-4ca2-4f96-af0b-6a35f2beaa67_1210x558.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>There&#8217;s no retry logic, no fallback behavior, no graceful degradation. Just exceptions waiting to blow up your application.</p><h2>The Observability Obstacle</h2><p>When something goes wrong in production (and it will), how do you debug it?</p><p>The current notebook has a `print` statement. That&#8217;s it. You have no logs, no metrics, no tracing. You can&#8217;t answer basic questions like:</p><ul><li><p>How long do requests actually take?</p></li><li><p>Which prompts are performing poorly?</p></li><li><p>What&#8217;s our error rate?</p></li><li><p>Which inputs consistently cause failures?</p></li><li><p>How much are we spending on API calls?</p></li></ul><p>You&#8217;re flying blind. I&#8217;ve been pulled into emergency debugging sessions where a service was failing intermittently. Having zero visibility into what&#8217;s happening is genuinely awful. You end up adding print statements, redeploying, waiting for it to fail again, checking logs, and repeating that cycle until you find the issue. It&#8217;s tedious and totally avoidable.</p><h2>What <em>Production-Ready</em> Actually Means</h2><p>I&#8217;ve thrown around the term &#8220;production-ready&#8221; a lot, but what does it actually mean?</p><p>Here&#8217;s my definition: production-ready code is code that you wouldn&#8217;t panic about if you got a Slack message saying it broke while you were in a meeting.</p><p>That means:</p><ul><li><p>Reliable (it handles errors gracefully and recovers from failures).</p></li><li><p>Observable (you can see what it&#8217;s doing and diagnose problems quickly).</p></li><li><p>Maintainable (other people can understand and modify it, including future you).</p></li><li><p>Testable (you can verify it works without manual testing).</p></li><li><p>Secure (it doesn&#8217;t leak secrets or expose vulnerabilities).</p></li><li><p>Performant (it can handle your expected load, plus some buffer).</p></li></ul><p>Notice I didn&#8217;t say <em><strong>perfect</strong></em>. Production-ready doesn&#8217;t mean bug-free. It means you&#8217;ve thought through the failure modes and built in safeguards.</p><h2>Where We&#8217;re Going</h2><p>Over the next four posts, we&#8217;re going to systematically fix every problem I just outlined. Here&#8217;s the roadmap &#129517;.</p><h4>Post 2: Treating Prompts Like Code with BAML</h4><p>We&#8217;ll introduce BAML (a domain-specific language for LLM applications) and use it to add type safety and structure to our prompts. You&#8217;ll learn why treating prompts like untyped strings is asking for trouble.</p><h4>Post 3: Testing Your LLM Applications (Without Going Broke)</h4><p>We&#8217;ll build a testing strategy that doesn&#8217;t require calling OpenAI&#8217;s API for every single test. You&#8217;ll learn about mocking, test doubles, and how to test LLM applications in a way that&#8217;s actually maintainable.</p><h4>Post 4: Self-Hosting LLMs with vLLM</h4><p>We&#8217;ll set up vLLM (a high-performance inference engine) to self-host open-source models. You&#8217;ll learn when self-hosting makes sense, when it doesn&#8217;t, and how to actually do it with Docker and GPU support.</p><h4>Post 5: The Complete Production Pipeline</h4><p>We&#8217;ll bring everything together: Dockerize the application, set up CI/CD, add proper monitoring and logging, implement error handling and retries, and deploy it.</p><p>By the end of this series, you&#8217;ll have transformed a messy notebook into a production system that you&#8217;d actually want to maintain. More importantly, you&#8217;ll understand the principles behind each decision.</p><h2>Homework - Audit Your Own Code</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!eff_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e9180dd-77e5-4300-acbd-504a11aa06ad_6720x4480.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!eff_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e9180dd-77e5-4300-acbd-504a11aa06ad_6720x4480.jpeg 424w, https://substackcdn.com/image/fetch/$s_!eff_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e9180dd-77e5-4300-acbd-504a11aa06ad_6720x4480.jpeg 848w, https://substackcdn.com/image/fetch/$s_!eff_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e9180dd-77e5-4300-acbd-504a11aa06ad_6720x4480.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!eff_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e9180dd-77e5-4300-acbd-504a11aa06ad_6720x4480.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!eff_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e9180dd-77e5-4300-acbd-504a11aa06ad_6720x4480.jpeg" width="1456" height="971" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3e9180dd-77e5-4300-acbd-504a11aa06ad_6720x4480.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:971,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:16949825,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpivot.dev/i/178704630?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e9180dd-77e5-4300-acbd-504a11aa06ad_6720x4480.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!eff_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e9180dd-77e5-4300-acbd-504a11aa06ad_6720x4480.jpeg 424w, https://substackcdn.com/image/fetch/$s_!eff_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e9180dd-77e5-4300-acbd-504a11aa06ad_6720x4480.jpeg 848w, https://substackcdn.com/image/fetch/$s_!eff_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e9180dd-77e5-4300-acbd-504a11aa06ad_6720x4480.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!eff_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3e9180dd-77e5-4300-acbd-504a11aa06ad_6720x4480.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Before moving on to the next post, I want you to do an honest audit of your own LLM code (if you have any). If you don&#8217;t have LLM code yet, grab any data science notebook you&#8217;ve written recently.</p><p>Go through and honestly assess these problems.</p><ol><li><p>Type safety: Can you tell what your functions return without running them?</p></li><li><p>Configuration: Are any secrets hardcoded? Did they ever make it into version control?</p></li><li><p>Testability: Can you test your logic without hitting external APIs?</p></li><li><p>String manipulation: Are your prompts scattered throughout your code as raw strings?</p></li><li><p>Error handling: What actually happens when things go wrong?</p></li><li><p>Observability: If this broke while you were away, could you figure out why within 15 minutes?</p></li></ol><p>Be brutally honest with yourself. I&#8217;m not here to judge (I&#8217;ve written plenty of questionable notebook code over the years). But you can&#8217;t fix problems you don&#8217;t acknowledge.</p><p>Write down what you find. Take notes on which problems show up most often in your code. We&#8217;re going to address each one over this series.</p><p>In the next post, we&#8217;ll tackle type safety and string manipulation by introducing BAML and treating our prompts like the first-class citizens they deserve to be. We&#8217;ll add structure and sanity to our LLM applications. No more crossing your fingers and hoping the output is what you expect.</p><p>Happy coding &#128640;!</p><p></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://mlpivot.dev/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Random Sampling Is Sabotaging Your Models! ]]></title><description><![CDATA[Did you know diversity can detect hallucinations? Ask an LLM "what's the capital of France" five times and you'll get some variation of "Paris" over and over again. But ask that same LLM about the "boiling point of a dragon's scale" and you'll get five different made-up answers.]]></description><link>https://mlpivot.dev/p/random-sampling-is-sabotaging-you</link><guid isPermaLink="false">https://mlpivot.dev/p/random-sampling-is-sabotaging-you</guid><dc:creator><![CDATA[The Machine Learning Pivot]]></dc:creator><pubDate>Sun, 09 Nov 2025 16:36:51 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/e72ce0f9-554e-43e7-b552-a3c6aeae5b1e_3290x1962.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Several months ago, my teammate spent some time debugging why our chatbot kept failing in production. The model worked perfectly in dev. We had a solid test set with 500 examples...or so we thought.</p><p>When we finally figured it out, we discovered our test set had 50 variations for the same edge case and completely missed the actual failure modes. Random sampling failed us &#128557;.</p><p>I used to randomly sample everything. Test sets? Random sample. Few-shot examples? Random sample. Active learning? You guessed it, random sample. It seemed like the fair, unbiased approach. Then I stumbled onto research about maximally diverse subsets, and my entire approach to data selection changed.</p><p>Turns out, there&#8217;s a whole body of research on how to do this properly, and it&#8217;s been hiding in plain sight. Let me show you what I learned (and the mistakes I made along the way).</p><h2>What Even Are Maximally Diverse Subsets?</h2><p>The concept is simpler than the name suggests. Imagine you have 1,000 data points and need to pick 50 of them. Random sampling might cluster all 50 data points in one corner of your feature space. Diverse sampling spreads these data points out so they actually represent your full dataset.</p><p>What&#8217;s interesting is that there are different ways to measure diversity, and each measurement has different meanings.</p><p><strong>Maximum pairwise distance </strong>picks points as far apart as possible from each other. If your data clusters naturally (like images of cats vs dogs vs birds), this approach grabs representatives from each cluster.</p><p><strong>Coverage-based selection</strong> makes sure you&#8217;re hitting different regions of your feature space. Think of it like ensuring every neighborhood in a city has at least one representative on a committee.</p><p><strong>Determinantal Point Processes (DPPs)</strong> use a probabilistic approach borrowed from quantum physics. Yes, quantum physics &#129327;. DPPs naturally repel similar items, and the math behind them is honestly beautiful once you get past the intimidating name.</p><h2>Why Should You Care? </h2><p>I see four places where diversity-aware sampling makes a huge difference in your day-to-day work.</p><p><strong>Building test sets:</strong> You want test cases covering different scenarios, not 100 variations of the same thing.</p><p><strong>Few-shot prompting:</strong> Show your LLM different patterns. Three diverse examples teach more than five similar ones (I tested this and loved the results).</p><p><strong>Hallucination detection</strong>: When an LLM hallucinates, it generates diverse outputs. Measure that diversity to catch hallucinations.</p><p><strong>Data pruning with DPPs:</strong> Shrink your dataset by keeping the interesting stuff and dropping redundant examples. Your training time will decrease which means your AWS bill won&#8217;t make you nervous by the end of the month &#128184;.</p><h2>Creating Better Test Sets</h2><p>I was building a test set for a computer vision model a couple years ago. We had 10,000 labeled images but budget to manually review only 500 for our test set. I did what seemed reasonable: random sampling.</p><p>The test set looked fine at first glance. But when we deployed to production, the model failed on certain types of images that never appeared in our test set. Random sampling had oversampled common scenarios and completely missed rare but important edge cases.</p><p>When I dug into it, about 30% of my test images showed basically the same scene because the training data skewed heavily in that direction. Diverse sampling would have caught this immediately.</p><p>Two methods fix this problem.</p><p><strong>K-means clustering</strong> groups similar images together and picks representatives from each cluster. It&#8217;s fast and works great for medium-sized datasets. I use this one most often now.</p><p><strong>MaxMin Selection</strong> greedily picks points far from already-selected points. It&#8217;s slower on large datasets but guarantees good coverage across your entire feature space. (For large datasets, combine it with clustering first - see Pitfall #2 for the efficient approach.) </p><p>Here&#8217;s what I should have done.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!3EKY!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67159099-65e1-4ebc-9a35-e536e90a64f6_1446x1788.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!3EKY!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67159099-65e1-4ebc-9a35-e536e90a64f6_1446x1788.png 424w, https://substackcdn.com/image/fetch/$s_!3EKY!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67159099-65e1-4ebc-9a35-e536e90a64f6_1446x1788.png 848w, https://substackcdn.com/image/fetch/$s_!3EKY!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67159099-65e1-4ebc-9a35-e536e90a64f6_1446x1788.png 1272w, https://substackcdn.com/image/fetch/$s_!3EKY!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67159099-65e1-4ebc-9a35-e536e90a64f6_1446x1788.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!3EKY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67159099-65e1-4ebc-9a35-e536e90a64f6_1446x1788.png" width="1446" height="1788" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/67159099-65e1-4ebc-9a35-e536e90a64f6_1446x1788.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1788,&quot;width&quot;:1446,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:437500,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpivot.substack.com/i/178276585?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67159099-65e1-4ebc-9a35-e536e90a64f6_1446x1788.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!3EKY!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67159099-65e1-4ebc-9a35-e536e90a64f6_1446x1788.png 424w, https://substackcdn.com/image/fetch/$s_!3EKY!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67159099-65e1-4ebc-9a35-e536e90a64f6_1446x1788.png 848w, https://substackcdn.com/image/fetch/$s_!3EKY!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67159099-65e1-4ebc-9a35-e536e90a64f6_1446x1788.png 1272w, https://substackcdn.com/image/fetch/$s_!3EKY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67159099-65e1-4ebc-9a35-e536e90a64f6_1446x1788.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>MaxMin ensures your test set spreads throughout your feature space. No clumping allowed. </p><h2>My Few-Shot Selection Disaster</h2><p>I was building a sentiment classifier for movie reviews. I had 1,000 labeled examples and could fit 5 in my prompt. I did the &#8220;smart&#8221; thing and used k-nearest neighbors to pick the 5 most similar examples to my test query.</p><p>My results were disappointing. The model kept making the same mistakes over and over.</p><p>Then I realized that all 5 examples were teaching the LLM basically the same pattern. They were all positive reviews using similar language. The model learned t<em>his one way to express positivity</em> and nothing else.</p><p>I switched to Maximum Marginal Relevance (MMR), which balances similarity with diversity. The first example matches your query closely. Each subsequent example adds new information instead of repeating what the model already saw.</p><p>The improvement was immediate and obvious.</p><p>Here&#8217;s the code I wish I&#8217;d written the first time &#128071;&#127998;</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Q0kY!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd79c73d-f564-4b2d-91dd-3577e78fdd67_1852x2160.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Q0kY!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd79c73d-f564-4b2d-91dd-3577e78fdd67_1852x2160.png 424w, https://substackcdn.com/image/fetch/$s_!Q0kY!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd79c73d-f564-4b2d-91dd-3577e78fdd67_1852x2160.png 848w, https://substackcdn.com/image/fetch/$s_!Q0kY!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd79c73d-f564-4b2d-91dd-3577e78fdd67_1852x2160.png 1272w, https://substackcdn.com/image/fetch/$s_!Q0kY!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd79c73d-f564-4b2d-91dd-3577e78fdd67_1852x2160.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Q0kY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd79c73d-f564-4b2d-91dd-3577e78fdd67_1852x2160.png" width="1456" height="1698" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dd79c73d-f564-4b2d-91dd-3577e78fdd67_1852x2160.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1698,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:559654,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpivot.substack.com/i/178276585?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd79c73d-f564-4b2d-91dd-3577e78fdd67_1852x2160.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Q0kY!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd79c73d-f564-4b2d-91dd-3577e78fdd67_1852x2160.png 424w, https://substackcdn.com/image/fetch/$s_!Q0kY!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd79c73d-f564-4b2d-91dd-3577e78fdd67_1852x2160.png 848w, https://substackcdn.com/image/fetch/$s_!Q0kY!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd79c73d-f564-4b2d-91dd-3577e78fdd67_1852x2160.png 1272w, https://substackcdn.com/image/fetch/$s_!Q0kY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd79c73d-f564-4b2d-91dd-3577e78fdd67_1852x2160.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The paper, &#8220;Exploring the Role of Diversity in Example Selection for In-Context Learning&#8221;, confirmed what I saw in practice. MMR consistently beats pure similarity-based selection across different context sizes and similarity functions.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://mlpivot.dev/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>The Clever Hallucination Detection Trick</h2><p>This one blew my mind when I first read about it. When an LLM hallucinates, it generates more diverse outputs across multiple samples. Why? Because it doesn&#8217;t actually know the answer, so it makes up different nonsense each time.</p><p>Think about it this way. Ask an LLM &#8220;What is the capital of France&#8221; five times. You&#8217;ll get five variations of &#8220;Paris&#8221; (maybe with different phrasing, but same core answer). <br><br>Now ask &#8220;What is the melting point of dragon scales?&#8221; five times. You&#8217;ll get five completely different made-up numbers because there&#8217;s no ground truth to anchor to. The diversity reveals the hallucination.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Bwb7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92adf815-35e3-4a39-8793-b4cb9292d65b_1430x1414.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Bwb7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92adf815-35e3-4a39-8793-b4cb9292d65b_1430x1414.png 424w, https://substackcdn.com/image/fetch/$s_!Bwb7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92adf815-35e3-4a39-8793-b4cb9292d65b_1430x1414.png 848w, https://substackcdn.com/image/fetch/$s_!Bwb7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92adf815-35e3-4a39-8793-b4cb9292d65b_1430x1414.png 1272w, https://substackcdn.com/image/fetch/$s_!Bwb7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92adf815-35e3-4a39-8793-b4cb9292d65b_1430x1414.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Bwb7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92adf815-35e3-4a39-8793-b4cb9292d65b_1430x1414.png" width="1430" height="1414" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/92adf815-35e3-4a39-8793-b4cb9292d65b_1430x1414.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1414,&quot;width&quot;:1430,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:360225,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpivot.substack.com/i/178276585?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92adf815-35e3-4a39-8793-b4cb9292d65b_1430x1414.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Bwb7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92adf815-35e3-4a39-8793-b4cb9292d65b_1430x1414.png 424w, https://substackcdn.com/image/fetch/$s_!Bwb7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92adf815-35e3-4a39-8793-b4cb9292d65b_1430x1414.png 848w, https://substackcdn.com/image/fetch/$s_!Bwb7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92adf815-35e3-4a39-8793-b4cb9292d65b_1430x1414.png 1272w, https://substackcdn.com/image/fetch/$s_!Bwb7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92adf815-35e3-4a39-8793-b4cb9292d65b_1430x1414.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Several recent hallucination detection papers explore this approach. The intuition is solid. Confident and correct models give consistent answers. Hallucinating models generate different nonsense each time because there&#8217;s no ground truth to anchor to.</p><h2>Data Pruning With Determinantal Point Processes</h2><p>Imagine you have 100k training samples, but training is expensive and slow. Can you shrink your dataset to 10k samples while maintaining model performance? Random sampling gives you mediocre results. But what if you could select the 10k most diverse, informative examples?</p><p>That&#8217;s where DPPs shine.</p><p>DPPs model diversity through a kernel matrix (basically a similarity matrix). The probability of selecting a subset is proportional to the determinant of the kernel matrix for that subset. Larger determinants equal more diverse subsets.</p><p>The paper, &#8220;Post-training Large Language Models for Diverse High-Quality Responses&#8221;, used DPPs to train models that generate both high-quality AND diverse outputs. The results beat standard training approaches across multiple benchmarks.</p><p>Here&#8217;s a simplified implementation (this is the greedy approximation, exact sampling gets more complex).</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!glyU!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c1cbb5f-8b02-4e6e-a7b8-586a3dac9c3f_1598x1638.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!glyU!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c1cbb5f-8b02-4e6e-a7b8-586a3dac9c3f_1598x1638.png 424w, https://substackcdn.com/image/fetch/$s_!glyU!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c1cbb5f-8b02-4e6e-a7b8-586a3dac9c3f_1598x1638.png 848w, https://substackcdn.com/image/fetch/$s_!glyU!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c1cbb5f-8b02-4e6e-a7b8-586a3dac9c3f_1598x1638.png 1272w, https://substackcdn.com/image/fetch/$s_!glyU!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c1cbb5f-8b02-4e6e-a7b8-586a3dac9c3f_1598x1638.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!glyU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c1cbb5f-8b02-4e6e-a7b8-586a3dac9c3f_1598x1638.png" width="1456" height="1492" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9c1cbb5f-8b02-4e6e-a7b8-586a3dac9c3f_1598x1638.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1492,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:328098,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpivot.substack.com/i/178276585?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c1cbb5f-8b02-4e6e-a7b8-586a3dac9c3f_1598x1638.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!glyU!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c1cbb5f-8b02-4e6e-a7b8-586a3dac9c3f_1598x1638.png 424w, https://substackcdn.com/image/fetch/$s_!glyU!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c1cbb5f-8b02-4e6e-a7b8-586a3dac9c3f_1598x1638.png 848w, https://substackcdn.com/image/fetch/$s_!glyU!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c1cbb5f-8b02-4e6e-a7b8-586a3dac9c3f_1598x1638.png 1272w, https://substackcdn.com/image/fetch/$s_!glyU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c1cbb5f-8b02-4e6e-a7b8-586a3dac9c3f_1598x1638.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The beauty of DPPs is that they give you a principled probabilistic framework for diversity. You&#8217;re not just greedily picking points. You&#8217;re sampling from a distribution that naturally prefers diverse subsets.</p><p>For data pruning, this means you can train on 10-20% of your data and match (or sometimes beat!) the performance of training on the full dataset. Less storage, faster training iterations, lower compute costs. Pretty cool, right? </p><p>You may be eager to start applying maximally diverse subsets in your work, but there are a couple pitfalls you need to watch out for.</p><h2>&#9888;&#65039; Pitfall #1: Diversity Without Quality</h2><p>I made this mistake because I was focused on getting diverse test samples, but I ended up with really low-quality examples in my test set. Some samples were near-duplicates with slight corruption. Others had labeling errors. One even had a cat labeled as &#8220;car&#8221; (still not sure how that happened).</p><p>Were they diverse? Sure. Were they useful? Nope.</p><p>The fix is straightforward. Filter for quality first, THEN apply diversity selection.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!KSLH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bd3ffc5-25bd-4164-9728-b858cab437f8_1886x1192.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!KSLH!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bd3ffc5-25bd-4164-9728-b858cab437f8_1886x1192.png 424w, https://substackcdn.com/image/fetch/$s_!KSLH!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bd3ffc5-25bd-4164-9728-b858cab437f8_1886x1192.png 848w, https://substackcdn.com/image/fetch/$s_!KSLH!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bd3ffc5-25bd-4164-9728-b858cab437f8_1886x1192.png 1272w, https://substackcdn.com/image/fetch/$s_!KSLH!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bd3ffc5-25bd-4164-9728-b858cab437f8_1886x1192.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!KSLH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bd3ffc5-25bd-4164-9728-b858cab437f8_1886x1192.png" width="1456" height="920" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2bd3ffc5-25bd-4164-9728-b858cab437f8_1886x1192.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:920,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:293040,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpivot.substack.com/i/178276585?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bd3ffc5-25bd-4164-9728-b858cab437f8_1886x1192.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!KSLH!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bd3ffc5-25bd-4164-9728-b858cab437f8_1886x1192.png 424w, https://substackcdn.com/image/fetch/$s_!KSLH!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bd3ffc5-25bd-4164-9728-b858cab437f8_1886x1192.png 848w, https://substackcdn.com/image/fetch/$s_!KSLH!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bd3ffc5-25bd-4164-9728-b858cab437f8_1886x1192.png 1272w, https://substackcdn.com/image/fetch/$s_!KSLH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bd3ffc5-25bd-4164-9728-b858cab437f8_1886x1192.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Just like any analytics, data science, or machine learning problem, you need clean data. Garbage in, garbage out, am I right?</p><h2>&#9888;&#65039; Pitfall #2: The Computational Cost</h2><p>Computing pairwise distances for large datasets is expensive.</p><p>My first implementation processed 10,000 data points to select 100 examples. It took about half an hour.</p><p>The problem? I was computing ~50 million pairwise comparisons. That&#8217;s O(n&#178;) complexity. When you&#8217;re iterating and experimenting, that adds up fast.</p><p>The fix? Use clustering to reduce the problem size.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!DN4y!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F583a7759-575e-4d1a-b3e7-3944c0d376ca_1768x1676.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DN4y!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F583a7759-575e-4d1a-b3e7-3944c0d376ca_1768x1676.png 424w, https://substackcdn.com/image/fetch/$s_!DN4y!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F583a7759-575e-4d1a-b3e7-3944c0d376ca_1768x1676.png 848w, https://substackcdn.com/image/fetch/$s_!DN4y!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F583a7759-575e-4d1a-b3e7-3944c0d376ca_1768x1676.png 1272w, https://substackcdn.com/image/fetch/$s_!DN4y!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F583a7759-575e-4d1a-b3e7-3944c0d376ca_1768x1676.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DN4y!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F583a7759-575e-4d1a-b3e7-3944c0d376ca_1768x1676.png" width="1456" height="1380" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/583a7759-575e-4d1a-b3e7-3944c0d376ca_1768x1676.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1380,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:471418,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpivot.substack.com/i/178276585?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F583a7759-575e-4d1a-b3e7-3944c0d376ca_1768x1676.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!DN4y!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F583a7759-575e-4d1a-b3e7-3944c0d376ca_1768x1676.png 424w, https://substackcdn.com/image/fetch/$s_!DN4y!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F583a7759-575e-4d1a-b3e7-3944c0d376ca_1768x1676.png 848w, https://substackcdn.com/image/fetch/$s_!DN4y!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F583a7759-575e-4d1a-b3e7-3944c0d376ca_1768x1676.png 1272w, https://substackcdn.com/image/fetch/$s_!DN4y!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F583a7759-575e-4d1a-b3e7-3944c0d376ca_1768x1676.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This is how the two-stage approach works.</p><p><strong>Step 1: Cluster your data</strong> into more clusters than you need (maybe 2x your target). This step is O(n &#215; k) where k is the number of clusters - much faster than O(n&#178;).</p><p><strong>Step 2: Pick one representative from each cluster</strong> - the point closest to each cluster center. Now you have maybe 200 representative points instead of 10,000.</p><p><strong>Step 3: Run maxmin on the representatives</strong> to select your final diverse subset. Computing pairwise distances on 200 points is trivial compared to 10,000.</p><p>Why does this work for diversity? The clusters identify distinct regions of your feature space. One sample per cluster gives you coverage across different regions. Then maxmin on the representatives ensures you&#8217;re selecting the most spread-out samples from those regions.</p><p>This two-stage approach drops the runtime to under a few minutes for my use case. For 100k samples selecting 1k, this runs orders of magnitude faster than computing all pairwise distances.</p><p>Your teammates will appreciate you not hogging up all the resources for your experimental work &#128521;.</p><h2>Note: Pick the Right Diversity Metric</h2><p>Not all distance metrics are created equal, and choosing the wrong one can give you weird results.</p><p><strong>Euclidean distance</strong> works for continuous embeddings where magnitude matters. I use this for image embeddings from ResNet or similar models.</p><p><strong>Cosine distance</strong> works better for normalized embeddings where direction matters more than magnitude. I default to this for text embeddings from sentence transformers.</p><p><strong>Semantic entropy </strong>captures meaning-level diversity. This requires clustering semantically equivalent outputs first. It&#8217;s more complex but sometimes worth it.</p><p>Your use case may result in using different distance metrics, and that&#8217;s fine as long as you&#8217;re aware that different metrics can render you different results.</p><h2>The Numbers Don&#8217;t Lie</h2><p>The research on diversity-aware sampling keeps getting better, and the results are honestly kind of exciting.</p><p>The February 2025 &#8220;Diversity as a Reward&#8221; paper showed a 27% improvement on math reasoning tasks. They used less than 10% of the original training data and got BETTER results. That&#8217;s awesome! Less data, less compute, better performance.</p><p>The few-shot learning improvements are equally impressive. The 2025 <strong>In-Context Learning</strong> paper showed consistent gains across different context sizes and similarity functions. No additional training required. Just change how you select examples.</p><p>The alignment work from &#8220;Efficient Alignment of Large Language Models via Data Sampling&#8221; demonstrated that you can match full-dataset performance using less than 10% of the data. Fewer labels to collect, less compute for training, faster iteration cycles. That&#8217;s a win every time.</p><p>These results come from different teams working on different models tackling different tasks. The pattern holds across domains. Diversity matters.</p><p>However, it&#8217;s not always the answer.</p><h2>When Not To Use Diversity</h2><h4>Case 1: Debugging Specific Failures</h4><p>Your model consistently chokes on questions about historical dates. You don&#8217;t want diverse examples. You want MORE historical date examples to understand the failure pattern.</p><h4>Case 2: Narrow Specialization</h4><p>You&#8217;re fine-tuning a model to generate SQL queries for your company&#8217;s specific database schema. You want homogeneous training data that teaches one thing really well. Diversity would dilute the signal.</p><h4>Case 3: Consistency Checking</h4><p>You&#8217;re testing whether slight input variations produce wildly different outputs. You need similar inputs on purpose.</p><p>Diversity is a tool for coverage and robustness. But sometimes you need focus instead.</p><h2>Your Action Plan for Next Week</h2><p>If you want to get started, start small by creating a function to help you select diverse test sets.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!RhVe!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5afa1a98-ebde-4a6b-bb53-28e089203606_1480x484.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!RhVe!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5afa1a98-ebde-4a6b-bb53-28e089203606_1480x484.png 424w, https://substackcdn.com/image/fetch/$s_!RhVe!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5afa1a98-ebde-4a6b-bb53-28e089203606_1480x484.png 848w, https://substackcdn.com/image/fetch/$s_!RhVe!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5afa1a98-ebde-4a6b-bb53-28e089203606_1480x484.png 1272w, https://substackcdn.com/image/fetch/$s_!RhVe!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5afa1a98-ebde-4a6b-bb53-28e089203606_1480x484.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!RhVe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5afa1a98-ebde-4a6b-bb53-28e089203606_1480x484.png" width="1456" height="476" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5afa1a98-ebde-4a6b-bb53-28e089203606_1480x484.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:476,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:117521,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://mlpivot.substack.com/i/178276585?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5afa1a98-ebde-4a6b-bb53-28e089203606_1480x484.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!RhVe!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5afa1a98-ebde-4a6b-bb53-28e089203606_1480x484.png 424w, https://substackcdn.com/image/fetch/$s_!RhVe!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5afa1a98-ebde-4a6b-bb53-28e089203606_1480x484.png 848w, https://substackcdn.com/image/fetch/$s_!RhVe!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5afa1a98-ebde-4a6b-bb53-28e089203606_1480x484.png 1272w, https://substackcdn.com/image/fetch/$s_!RhVe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5afa1a98-ebde-4a6b-bb53-28e089203606_1480x484.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>With this, you&#8217;ll build better test sets, select better few-shot examples, and catch failures earlier. Hopefully, you can spend less time debugging and more time creating.</p><h2>Continued Reading</h2><p>If this post clicked with you and you want to dive deeper, check out these papers. They&#8217;re all recent and fairly straightforward.</p><p><strong>&#8220;Diversity as a Reward: Fine-Tuning LLMs on a Mixture of Domain-Undetermined Data&#8221;</strong> (2025)<br><a href="https://arxiv.org/abs/2502.04380">https://arxiv.org/abs/2502.04380</a></p><p><strong>&#8220;Post-training Large Language Models for Diverse High-Quality Responses&#8221;</strong> (2025)<br><a href="https://arxiv.org/abs/2509.04784">https://arxiv.org/abs/2509.04784</a></p><p><strong>&#8220;Exploring the Role of Diversity in Example Selection for In-Context Learning&#8221;</strong> (2025)<br><a href="https://www.arxiv.org/abs/2505.01842">https://www.arxiv.org/abs/2505.01842</a></p><p><strong>&#8220;Efficient Alignment of Large Language Models via Data Sampling&#8221;</strong> (2024)<br><a href="https://arxiv.org/abs/2411.10545">https://arxiv.org/abs/2411.10545</a></p><h2>Code Available Here &#128206;</h2><p>All code snippets from this post are available in this <a href="https://github.com/MLpivot/diversity-aware-sampling">GitHub repo</a>.</p><p>Feel free to experiment with the MMR implementation and diversity parameter tuning!</p><p>Happy coding &#10024;</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://mlpivot.dev/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item></channel></rss>