Featured

My First Kernel Module: A Technical Deep Dive into Building an Intel GNA Driver

Andrew Smalley

22 Feb 2025 • 5 min read

Introduction

Entering kernel development is an adventure in both creativity and technical rigour. In my past project, I set out to build a Linux kernel module to support Intel's Gaussian & Neural Accelerator (GNA) – the dedicated AI accelerator present on the i9‑13900K. I scoured old email archives and patch threads (from mailing lists like vtiger and LKML) and gathered pieces of code that dealt with PCI probing, memory management, interrupt Handling, and even a basic IOCTL interface. The result is an out‑of‑tree module (available at akadata/linux-kernel-gna-patch) demonstrating how to integrate and drive new hardware before it's accepted upstream.

In this post, I'll share representative patch excerpts, explain how each section works, and dive into the technical details behind the driver.

1. Overview of the Patch Series

The final driver comprises several key components:

Device Probing & Resource Allocation: Detecting the GNA device on the PCI bus and mapping its I/O memory.
Interrupt Handling: Registering an interrupt handler that responds to hardware events.
Memory Management: Allocating DMA buffers, setting up the MMU for the device, and handling user‑space memory mapping.
User‑Space Interface (IOCTLs): Exposing a minimal API for applications to communicate with the driver.

Below, I've included excerpts from each area and an explanation of how the patch works.

2. Device Probing and Resource Allocation

When the driver is loaded, the first task is to detect the GNA device on the PCI bus and allocate its resources. Here's an excerpt from the probe function

static int gna_probe(struct device *parent, struct gna_dev_info *dev_info, 
                     void __iomem *iobase, int irq)
{
    static atomic_t dev_last_idx = ATOMIC_INIT(-1);
    struct gna_private *gna_priv;
    const char *dev_misc_name;
    int ret;

    gna_priv = devm_kzalloc(parent, sizeof(*gna_priv), GFP_KERNEL);
    if (!gna_priv)
        return -ENOMEM;

    /* Allocate a unique index for this instance */
    gna_priv->index = atomic_inc_return(&dev_last_idx);
    gna_priv->recovery_timeout_jiffies = msecs_to_jiffies(recovery_timeout * 1000);
    gna_priv->iobase = iobase;
    gna_priv->info = *dev_info;

    /* Allocate and assign a device name (e.g., "intel_gna0") */
    dev_misc_name = devm_kasprintf(parent, GFP_KERNEL, "%s%d", GNA_DV_NAME, gna_priv->index);
    if (!dev_misc_name)
        return -ENOMEM;
    gna_priv->misc.name = dev_misc_name;

    /* DMA mask setup – essential for mapping 64-bit addresses */
    if (!(sizeof(dma_addr_t) > 4) || dma_set_mask(parent, DMA_BIT_MASK(64))) {
        ret = dma_set_mask(parent, DMA_BIT_MASK(32));
        if (ret) {
            dev_err(parent, "dma_set_mask error: %d\n", ret);
            return ret;
        }
    }

    /* The remainder of this function registers the device as a misc device */
    gna_priv->misc.minor = MISC_DYNAMIC_MINOR;
    gna_priv->misc.fops = &gna_file_ops;
    gna_priv->misc.mode = 0666;

    ret = gna_devm_register_misc_dev(parent, &gna_priv->misc);
    if (ret)
        return ret;

    dev_set_drvdata(parent, gna_priv);
    return 0;
}

Explanation

Resource Allocation: The probe function uses devm_kzalloc() for automatic cleanup. It assigns a unique index via an atomic counter and sets the recovery timeout based on module parameters.
DMA Configuration: The code sets up the DMA mask, ensuring that the device can address memory correctly—first trying 64‑bit, then falling back to 32‑bit if needed.
Misc Device Registration: Registering the device as a misc device creates a user‑accessible node (e.g. /dev/intel_gna0) that applications can open and communicate with via IOCTLs.

3. Interrupt Handling

A key part of any hardware driver is managing interrupts. The driver registers an interrupt handler to respond when the GNA hardware completes its tasks:

static irqreturn_t gna_interrupt(int irq, void *priv)
{
    struct gna_private *gna_priv = (struct gna_private *)priv;

    /* Clear the busy flag and wake up any waiting processes */
    gna_priv->dev_busy = false;
    wake_up(&gna_priv->dev_busy_waitq);

    return IRQ_HANDLED;
}

Explanation

Interrupt Routine: When an interrupt occurs, the handler clears a flag (dev_busy) that indicates the device was busy processing a request. It then wakes up processes waiting on a wait queue (dev_busy_waitq).
Return Value: The handler returns IRQ_HANDLED to signal that the interrupt was successfully processed.

4. Memory Management and DMA Handling

Memory management is one of the more challenging aspects. The driver allocates coherent DMA buffers and sets up the device's internal MMU. An excerpt from the memory allocation patch:

int gna_mmu_alloc(struct gna_private *gna_priv)
{
    struct device *parent = gna_parent(gna_priv);
    struct gna_mmu_object *mmu;
    int desc_size, i;

    if (gna_priv->info.num_pagetables > GNA_PGDIRN_LEN) {
        dev_err(gna_dev(gna_priv), "too many pagetables requested\n");
        return -EINVAL;
    }

    mmu = &gna_priv->mmu;
    desc_size = round_up(gna_priv->info.desc_info.desc_size, PAGE_SIZE);

    mmu->hwdesc = dmam_alloc_coherent(parent, desc_size, &mmu->hwdesc_dma, GFP_KERNEL);
    if (!mmu->hwdesc)
        return -ENOMEM;

    mmu->num_pagetables = gna_priv->info.num_pagetables;
    mmu->pagetables_dma = devm_kmalloc_array(parent, mmu->num_pagetables, sizeof(*mmu->pagetables_dma), GFP_KERNEL);
    if (!mmu->pagetables_dma)
        return -ENOMEM;

    mmu->pagetables = devm_kmalloc_array(parent, mmu->num_pagetables, sizeof(*mmu->pagetables), GFP_KERNEL);
    if (!mmu->pagetables)
        return -ENOMEM;

    for (i = 0; i < mmu->num_pagetables; i++) {
        mmu->pagetables[i] = dmam_alloc_coherent(parent, PAGE_SIZE, &mmu->pagetables_dma[i], GFP_KERNEL);
        if (!mmu->pagetables[i])
            return -ENOMEM;
    }

    /* Initialises the page directory with DMA addresses from allocated pagetables */
    for (i = 0; i < mmu->num_pagetables; i++) {
        mmu->hwdesc->mmu.pagedir_n[i] = mmu->pagetables_dma[i] >> PAGE_SHIFT;
    }

    return 0;
}

Explanation

Coherent Memory Allocation: The driver uses dmam_alloc_coherent() to allocate memory that is safe for DMA, ensuring the device has a consistent memory view.
Page Table Setup: After allocating individual page tables, the driver writes their DMA addresses (shifted by PAGE_SHIFT) into the device's page directory, allowing the hardware to access memory correctly.

5. User‑Space Interface and IOCTLs

An IOCTL interface is provided to allow user‑space applications to interact with the driver. This enables memory mapping, retrieving device parameters, and submitting computation requests. An excerpt from the IOCTL handler:

long gna_ioctl(struct file *f, unsigned int cmd, unsigned long arg)
{
    struct gna_file_private *file_priv = f->private_data;
    struct gna_private *gna_priv = file_priv->gna_priv;
    void __user *argptr = (void __user *)arg;
    int ret = 0;

    switch (cmd) {
    case GNA_GET_PARAMETER:
        ret = gna_getparam(gna_priv, (union gna_parameter *)argptr);
        break;
    case GNA_MAP_MEMORY:
        ret = gna_ioctl_map(file_priv, argptr);
        break;
    case GNA_UNMAP_MEMORY:
        ret = gna_ioctl_free(file_priv, arg);
        break;
    case GNA_COMPUTE:
        ret = gna_ioctl_score(file_priv, argptr);
        break;
    case GNA_WAIT:
        ret = gna_ioctl_wait(f, argptr);
        break;
    default:
        ret = -EINVAL;
        break;
    }
    return ret;
}

Explanation

Command Handling: The IOCTL function checks the command (cmd) passed from user space and calls the appropriate handler function. This includes getting parameters, mapping/unmapping memory, submitting a computation request, and waiting for completion.
Error Handling: While detailed debugging is trimmed down for production, the structure ensures that unknown commands return an error (-EINVAL).

6. Conclusion

Building my first kernel module for Intel GNA was a challenging yet enlightening experience. By collecting code fragments from archived patches, I was able to create a working driver that:

Probes and Allocates Resources: Detects the GNA device and maps its I/O memory.
Handles Interrupts: Registers an interrupt handler to manage hardware events.
Manages Memory for DMA: Allocates coherent memory, sets page tables, and configures the device's MMU.
Exposes a User‑Space API: Implements an IOCTL interface to allow user applications to control the hardware.

Each patch component played a vital role in creating a cohesive driver. Although my module is not yet part of the mainline Linux kernel, it is a proof-of-concept and a learning tool for those interested in kernel development and hardware integration.

I invite fellow developers to review the patches, experiment with the code, and contribute improvements. Open-source development thrives on collaboration and knowledge sharing, and this project is just one step toward broader support for modern AI accelerators in Linux.

Happy hacking, and enjoy exploring the depths of kernel programming!

Feel free to adapt and extend this article with additional patch excerpts or technical details as your project evolves.