Using Huge Pages in Linux for Big Data Processing

Using Huge Pages in Linux for Big Data Processing

Huge pages can significantly improve performance for big data processing by reducing TLB (Translation Lookaside Buffer) misses and memory management overhead. Here's how to use them in Linux with C/C++ examples.

1. Configuring Huge Pages in Linux

First, configure huge pages on your system:

bash 复制代码
# Check current huge page settings
cat /proc/meminfo | grep Huge

# Set number of huge pages (e.g., 1024 pages of 2MB each = 2GB)
sudo sysctl vm.nr_hugepages=1024

# Make it persistent by adding to /etc/sysctl.conf
echo "vm.nr_hugepages=1024" | sudo tee -a /etc/sysctl.conf
sudo sysctl -p

2. C/C++ Example Using Huge Pages

Here's a complete example demonstrating huge page allocation:

c 复制代码
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <fcntl.h>
#include <unistd.h>

#define HUGE_PAGE_SIZE (2 * 1024 * 1024)  // 2MB for x86_64
#define ARRAY_SIZE (1024 * 1024 * 1024)    // 1GB array

// Method 1: Using mmap with MAP_HUGETLB flag
void* allocate_huge_pages_mmap(size_t size) {
    void* ptr = mmap(NULL, size, PROT_READ | PROT_WRITE,
                    MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB,
                    -1, 0);
    
    if (ptr == MAP_FAILED) {
        perror("mmap");
        return NULL;
    }
    
    printf("Allocated %zu bytes using mmap+MAP_HUGETLB at %p\n", size, ptr);
    return ptr;
}

// Method 2: Using hugetlbfs filesystem
void* allocate_huge_pages_hugetlbfs(size_t size) {
    char path[] = "/dev/hugepages/hugepagefile";
    int fd = open(path, O_CREAT | O_RDWR, 0755);
    
    if (fd < 0) {
        perror("open");
        return NULL;
    }
    
    void* ptr = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
    if (ptr == MAP_FAILED) {
        perror("mmap");
        close(fd);
        return NULL;
    }
    
    printf("Allocated %zu bytes using hugetlbfs at %p\n", size, ptr);
    close(fd);
    return ptr;
}

void process_large_data(double* data, size_t size) {
    // Simulate big data processing
    for (size_t i = 0; i < size / sizeof(double); i++) {
        data[i] = i * 0.5;
    }
    
    // Do some computation
    double sum = 0;
    for (size_t i = 0; i < size / sizeof(double); i++) {
        sum += data[i];
    }
    
    printf("Processing completed. Sum: %f\n", sum);
}

int main() {
    // Allocate memory using huge pages
    double* huge_data = (double*)allocate_huge_pages_mmap(ARRAY_SIZE);
    if (!huge_data) {
        fprintf(stderr, "Failed to allocate using mmap+MAP_HUGETLB. Trying hugetlbfs...\n");
        huge_data = (double*)allocate_huge_pages_hugetlbfs(ARRAY_SIZE);
        if (!huge_data) {
            fprintf(stderr, "Failed to allocate huge pages. Falling back to regular pages.\n");
            huge_data = (double*)malloc(ARRAY_SIZE);
            if (!huge_data) {
                perror("malloc");
                return 1;
            }
        }
    }
    
    // Process data
    process_large_data(huge_data, ARRAY_SIZE);
    
    // Free memory
    if (munmap(huge_data, ARRAY_SIZE) {
        perror("munmap");
    }
    
    return 0;
}

3. Compiling and Running

Compile the program with:

bash 复制代码
gcc -o hugepage_demo hugepage_demo.c

Run it with:

bash 复制代码
./hugepage_demo

4. Verifying Huge Page Usage

Check huge page usage after running your program:

bash 复制代码
cat /proc/meminfo | grep Huge

5. Important Notes

  1. Permissions: Your program may need appropriate permissions to use huge pages.

  2. Page Size: Default huge page size is typically 2MB. 1GB pages are also available on some systems.

  3. Allocation: Huge page allocation must be contiguous in physical memory.

  4. Transparent Huge Pages (THP) : Linux also supports THP which automatically promotes regular pages to huge pages. Enable with:

    bash 复制代码
    echo "always" | sudo tee /sys/kernel/mm/transparent_hugepage/enabled

6. When to Use Huge Pages

Huge pages are particularly beneficial for:

  • Large in-memory databases
  • Scientific computing applications
  • Big data processing frameworks
  • Any memory-intensive application processing large datasets

The performance improvement comes from reduced TLB pressure and fewer page faults when working with large datasets.


资料

Understanding Huge Pages: Optimizing Memory Usage
Linux HugePages(大内存页) 原理与使用
Performance Benefits of Using Huge Pages for Code
Optimizing Linux for AMD EPYC™ 9005 Series Processors with SUSE Linux Enterprise 15 SP6

相关推荐
claider5 分钟前
Vim User Manual 阅读笔记 usr_25.txt Editing formatted text 编辑有格式的文本
linux·笔记·vim
yiwenrong8 分钟前
系统初始化
linux
逸Y 仙X9 分钟前
文章八:ElasticSearch特殊数据字段类型解读
java·大数据·linux·运维·elasticsearch·搜索引擎
脱脱克克11 分钟前
云端 OpenClaw 远程执行本地进程原理机制详解:Gateway、approvals 与 system.run 到底谁在判定、谁在执行?
linux·gateway·openclaw
行者..................13 分钟前
第2课:恢复出厂、掌握 Linux 基础命令并完成首次 GCC 编译
linux·qt·driver
竹之却14 分钟前
国内外主流大模型全面解析(2026版)
服务器·大模型·国内外主流大模型·openclaw·云养虾
源远流长jerry24 分钟前
RDMA Memory Region (MR) 机制详解:地址转换与内存保护
linux·服务器·网络·tcp/ip·架构·mr
c++逐梦人25 分钟前
Linux进程信号
linux·服务器
逸Y 仙X29 分钟前
文章九:ElasticSearch索引字段常见属性
java·大数据·服务器·数据库·elasticsearch·搜索引擎
野犬寒鸦36 分钟前
JVM垃圾回收机制深度解析(G1篇)(垃圾回收过程及专业名词详解)
java·服务器·jvm·后端·面试