<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
    <channel>
        <title>Posts on Rosen Papazov</title>
        <link>https://shadowcp.dev/posts/</link>
        <description>Recent content in Posts on Rosen Papazov</description>
        <generator>Hugo -- gohugo.io</generator>
        <language>en</language>
        <copyright>&lt;a href=&#34;https://creativecommons.org/licenses/by-nc/4.0/&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;CC BY-NC 4.0&lt;/a&gt;</copyright>
        <lastBuildDate>Mon, 06 Apr 2026 12:00:00 +0300</lastBuildDate>
        <atom:link href="https://shadowcp.dev/posts/index.xml" rel="self" type="application/rss+xml" />
        
        <item>
            <title>Running Local AI on Intel Lunar Lake — Part 1: Hardware, Drivers, and the Intel Compute Stack</title>
            <link>https://shadowcp.dev/posts/2026/04/running-local-ai-on-intel-lunar-lake-part-1-hardware-drivers-and-the-intel-compute-stack/</link>
            <pubDate>Mon, 06 Apr 2026 12:00:00 +0300</pubDate>
            
            <guid>https://shadowcp.dev/posts/2026/04/running-local-ai-on-intel-lunar-lake-part-1-hardware-drivers-and-the-intel-compute-stack/</guid>
            <description>Part 1 of 4 in a series on building a fully local AI development environment on an Intel Core Ultra 7 268V laptop running Fedora 43.
Why local AI? Cloud-hosted AI is convenient until it isn&amp;rsquo;t. Latency spikes, privacy concerns with proprietary code, rate limits during crunch time, and the monthly bill that creeps up. Running models locally solves all of these — if your hardware can handle it.
Intel&amp;rsquo;s Lunar Lake processors are interesting for this.</description>
            <content type="html"><![CDATA[<p><em>Part 1 of 4 in a series on building a fully local AI development environment on an Intel Core Ultra 7 268V laptop running Fedora 43.</em></p>
<hr>
<h2 id="why-local-ai">Why local AI?</h2>
<p>Cloud-hosted AI is convenient until it isn&rsquo;t. Latency spikes, privacy concerns with proprietary code, rate limits during crunch time, and the monthly bill that creeps up. Running models locally solves all of these — if your hardware can handle it.</p>
<p>Intel&rsquo;s Lunar Lake processors are interesting for this. They ship with three compute engines: a CPU, an integrated Arc GPU, and a dedicated NPU (Neural Processing Unit). The question is whether the Linux software stack is mature enough to actually use them for LLM inference.</p>
<p>This series documents my experience setting it up on Fedora 43. Spoiler: it works, but the path has sharp edges.</p>
<h2 id="the-hardware">The hardware</h2>
<p>The Intel Core Ultra 7 268V (Lunar Lake) packs:</p>
<ul>
<li><strong>CPU:</strong> 8 cores (4P + 4E), up to 5.0 GHz</li>
<li><strong>GPU:</strong> Intel Arc 130V/140V — 8 Xe2 cores at up to 2.0 GHz</li>
<li><strong>NPU:</strong> Intel AI Boost — 48 TOPS at INT8</li>
<li><strong>RAM:</strong> 30 GB LPDDR5x-8533, on-package</li>
</ul>
<p>The critical detail is <strong>shared memory</strong>. There is no dedicated VRAM. The GPU, NPU, and CPU all draw from the same 30 GB pool. Every gigabyte allocated to a model is a gigabyte taken from the system. This shapes every decision that follows.</p>
<h2 id="what-were-building">What we&rsquo;re building</h2>
<p>Two inference stacks, each serving a different purpose:</p>
<ol>
<li><strong>Code autocompletion</strong> — a small, fast model (Qwen2.5-Coder-1.5B, INT4 quantized) served via OpenVINO on the integrated GPU, consumed by VS Code through the Continue.dev extension.</li>
<li><strong>IT reasoning and debugging</strong> — a larger chain-of-thought model (DeepSeek-R1-Distill-Qwen-7B, Q4_K_M) served via llama.cpp with the SYCL backend, accessible through Open WebUI in a browser.</li>
</ol>
<p>The architecture:</p>
<pre tabindex="0"><code>VS Code (Continue.dev)
  └─ Tab autocomplete ──▶ completion-server :8081  (OpenVINO GenAI)

Browser (Open WebUI :3000)
  └─ Chat / reasoning ──▶ llama-gpu :8080  (llama.cpp SYCL)
</code></pre><p>Both run as systemd user services. The completion server auto-starts on login; the reasoning stack starts on demand.</p>
<h2 id="installing-the-intel-compute-stack">Installing the Intel compute stack</h2>
<p>Fedora 43 ships kernel 6.19, which has the <code>intel_vpu</code> (NPU) and <code>xe</code> (GPU) drivers built in. The kernel side works out of the box — <code>/dev/dri/renderD128</code> for the GPU and <code>/dev/accel/accel0</code> for the NPU are both present at boot. What&rsquo;s missing is the userspace.</p>
<h3 id="gpu-compute-runtime">GPU compute runtime</h3>
<p>The GPU needs OpenCL and Level Zero libraries for compute workloads. Fedora has these packaged:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>sudo dnf install intel-compute-runtime intel-opencl intel-level-zero oneapi-level-zero
</span></span></code></pre></div><p>A subtle point: <code>intel-level-zero</code> is the GPU-specific Level Zero <em>backend</em>. <code>oneapi-level-zero</code> is the Level Zero <em>loader</em> (<code>libze_loader.so.1</code>). You need both. Without the loader, nothing can discover the GPU backend — OpenVINO will see <code>CPU</code> only.</p>
<p>I spent time debugging this because the package names suggest they&rsquo;re the same thing. They aren&rsquo;t.</p>
<h3 id="npu-userspace-driver--building-from-source">NPU userspace driver — building from source</h3>
<p>This is where Fedora&rsquo;s packaging falls short. The <code>intel-npu-driver</code> package exists in Fedora Rawhide but isn&rsquo;t available in Fedora 43&rsquo;s repos. Intel&rsquo;s official releases on GitHub only provide <code>.deb</code> packages for Ubuntu. The community COPR that used to provide RPMs was archived in October 2025.</p>
<p>That leaves building from source:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>git clone --depth <span style="color:#ae81ff">1</span> https://github.com/intel/linux-npu-driver.git /tmp/linux-npu-driver
</span></span><span style="display:flex;"><span>cd /tmp/linux-npu-driver
</span></span><span style="display:flex;"><span>git submodule update --init --recursive
</span></span><span style="display:flex;"><span>cmake -B build -S . <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  -DCMAKE_CXX_FLAGS<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;-fcf-protection=none&#34;</span> <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  -DCMAKE_C_FLAGS<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;-fcf-protection=none&#34;</span>
</span></span><span style="display:flex;"><span>cmake --build build --parallel <span style="color:#66d9ef">$(</span>nproc<span style="color:#66d9ef">)</span>
</span></span><span style="display:flex;"><span>sudo cmake --install build
</span></span></code></pre></div><p>The <code>-fcf-protection=none</code> flag is needed because the driver&rsquo;s build system doesn&rsquo;t account for Fedora&rsquo;s default control-flow enforcement settings.</p>
<p>The resulting <code>libze_intel_npu.so</code> installs to <code>/usr/local/lib64/</code>, which isn&rsquo;t in Fedora&rsquo;s default library search path:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>echo /usr/local/lib64 | sudo tee /etc/ld.so.conf.d/local-lib64.conf
</span></span><span style="display:flex;"><span>sudo ldconfig
</span></span></code></pre></div><h3 id="verification">Verification</h3>
<p>After installation and a re-login (to pick up group membership changes):</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>$ python3 -c <span style="color:#e6db74">&#34;from openvino import Core; print(Core().available_devices)&#34;</span>
</span></span><span style="display:flex;"><span><span style="color:#f92672">[</span><span style="color:#e6db74">&#39;CPU&#39;</span>, <span style="color:#e6db74">&#39;GPU&#39;</span>, <span style="color:#e6db74">&#39;NPU&#39;</span><span style="color:#f92672">]</span>
</span></span></code></pre></div><p>All three devices visible. The foundation is in place.</p>
<h2 id="the-memory-budget">The memory budget</h2>
<p>Before choosing models, I mapped out the memory constraints. With 30 GB total and ~13 GB typically consumed by the OS, desktop, and VS Code, I have roughly 17 GB for AI workloads. But I want headroom for KV cache, context windows, and not swapping under load. My practical budget is ~8 GB for models.</p>
<table>
<thead>
<tr>
<th>Component</th>
<th>RAM</th>
</tr>
</thead>
<tbody>
<tr>
<td>7B model (Q4_K_M) + KV cache</td>
<td>~5.5 GB</td>
</tr>
<tr>
<td>1.5B model (INT4) + overhead</td>
<td>~1.2 GB</td>
</tr>
<tr>
<td>Open WebUI container</td>
<td>~0.5 GB</td>
</tr>
<tr>
<td><strong>AI total</strong></td>
<td><strong>~7.2 GB</strong></td>
</tr>
<tr>
<td>OS + desktop + VS Code</td>
<td>~13 GB</td>
</tr>
<tr>
<td><strong>System total</strong></td>
<td><strong>~20 GB</strong></td>
</tr>
<tr>
<td><strong>Remaining</strong></td>
<td><strong>~10 GB</strong></td>
</tr>
</tbody>
</table>
<p>This rules out 14B+ models for the reasoning stack (Q4_K_M would be ~8.5 GB, leaving almost nothing for context) and confirms that two models can coexist comfortably.</p>
<h2 id="why-these-specific-models">Why these specific models?</h2>
<h3 id="qwen25-coder-15b-for-completion">Qwen2.5-Coder-1.5B for completion</h3>
<p>For inline code completion, latency matters more than capability. The model needs to respond in under a second to feel useful. Qwen2.5-Coder-1.5B hits the sweet spot:</p>
<ul>
<li>Outperforms StarCoder2-3B on code completion benchmarks despite being half the size</li>
<li>At INT4 quantization, the entire model fits in under 1 GB</li>
<li>Fast enough for real-time autocomplete on integrated GPU</li>
</ul>
<h3 id="deepseek-r1-distill-qwen-7b-for-reasoning">DeepSeek-R1-Distill-Qwen-7B for reasoning</h3>
<p>For debugging and IT problem-solving, I wanted chain-of-thought reasoning — the model should show its work, not just give an answer. DeepSeek-R1&rsquo;s distilled variants inherit this behavior from the full R1 model.</p>
<p>The 7B variant on the Qwen 2.5 base is the largest that fits comfortably in the memory budget. At Q4_K_M quantization (4.4 GB file), it leaves room for a 4096-token context window.</p>
<p>Observed throughput: <strong>~16.5 tokens/sec</strong> on the Arc integrated GPU. Not fast, but usable for interactive chat where you&rsquo;re reading the response as it generates.</p>
<h2 id="whats-next">What&rsquo;s next</h2>
<p>In Part 2, I cover the model conversion pipeline, the two inference engines (llama.cpp SYCL and OpenVINO GenAI), writing a custom API server, and the pitfalls I hit along the way — including a Python version incompatibility, a dependency matrix from hell, and the NPU that almost worked.</p>
<p>In Part 3, I wire everything into VS Code and Open WebUI, create systemd services for lifecycle management, and share the operational lessons learned.</p>
<hr>
<p><em>Series: Running Local AI on Intel Lunar Lake</em></p>
<ul>
<li><strong>Part 1: Hardware, Drivers, and the Intel Compute Stack</strong> (you are here)</li>
<li>Part 2: Models, Inference Engines, and the NPU That Almost Worked <em>(coming soon)</em></li>
<li>Part 3: VS Code, Open WebUI, and Running It All as Services <em>(coming soon)</em></li>
<li>Part 4: Consolidating on Ollama with IPEX-LLM <em>(coming soon)</em></li>
</ul>
]]></content>
        </item>
        
    </channel>
</rss>
