Docker--Apache/hadoop

Apache Hadoop

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

Quickstart

A Hadoop cluster can be created by pulling in the relevant docker image and specifying the required configurations.

Example building the latest hadoop-3 image
Create a basic docker-compose.yaml file like:
复制代码
version: "2"
services:
   namenode:
      image: apache/hadoop:3
      hostname: namenode
      command: ["hdfs", "namenode"]
      ports:
        - 9870:9870
      env_file:
        - ./config
      environment:
          ENSURE_NAMENODE_DIR: "/tmp/hadoop-root/dfs/name"
   datanode:
      image: apache/hadoop:3
      command: ["hdfs", "datanode"]
      env_file:
        - ./config      
   resourcemanager:
      image: apache/hadoop:3
      hostname: resourcemanager
      command: ["yarn", "resourcemanager"]
      ports:
         - 8088:8088
      env_file:
        - ./config
      volumes:
        - ./test.sh:/opt/test.sh
   nodemanager:
      image: apache/hadoop:3
      command: ["yarn", "nodemanager"]
      env_file:
        - ./config

Change the image: apache/hadoop:3 incase you want to build any other image like image: apache/hadoop:3.3.5 for building Apache Hadoop 3.3.5 image

Create a config file like:
复制代码
CORE-SITE.XML_fs.default.name=hdfs://namenode
CORE-SITE.XML_fs.defaultFS=hdfs://namenode
HDFS-SITE.XML_dfs.namenode.rpc-address=namenode:8020
HDFS-SITE.XML_dfs.replication=1
MAPRED-SITE.XML_mapreduce.framework.name=yarn
MAPRED-SITE.XML_yarn.app.mapreduce.am.env=HADOOP_MAPRED_HOME=$HADOOP_HOME
MAPRED-SITE.XML_mapreduce.map.env=HADOOP_MAPRED_HOME=$HADOOP_HOME
MAPRED-SITE.XML_mapreduce.reduce.env=HADOOP_MAPRED_HOME=$HADOOP_HOME
YARN-SITE.XML_yarn.resourcemanager.hostname=resourcemanager
YARN-SITE.XML_yarn.nodemanager.pmem-check-enabled=false
YARN-SITE.XML_yarn.nodemanager.delete.debug-delay-sec=600
YARN-SITE.XML_yarn.nodemanager.vmem-check-enabled=false
YARN-SITE.XML_yarn.nodemanager.aux-services=mapreduce_shuffle
CAPACITY-SCHEDULER.XML_yarn.scheduler.capacity.maximum-applications=10000
CAPACITY-SCHEDULER.XML_yarn.scheduler.capacity.maximum-am-resource-percent=0.1
CAPACITY-SCHEDULER.XML_yarn.scheduler.capacity.resource-calculator=org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator
CAPACITY-SCHEDULER.XML_yarn.scheduler.capacity.root.queues=default
CAPACITY-SCHEDULER.XML_yarn.scheduler.capacity.root.default.capacity=100
CAPACITY-SCHEDULER.XML_yarn.scheduler.capacity.root.default.user-limit-factor=1
CAPACITY-SCHEDULER.XML_yarn.scheduler.capacity.root.default.maximum-capacity=100
CAPACITY-SCHEDULER.XML_yarn.scheduler.capacity.root.default.state=RUNNING
CAPACITY-SCHEDULER.XML_yarn.scheduler.capacity.root.default.acl_submit_applications=*
CAPACITY-SCHEDULER.XML_yarn.scheduler.capacity.root.default.acl_administer_queue=*
CAPACITY-SCHEDULER.XML_yarn.scheduler.capacity.node-locality-delay=40
CAPACITY-SCHEDULER.XML_yarn.scheduler.capacity.queue-mappings=
CAPACITY-SCHEDULER.XML_yarn.scheduler.capacity.queue-mappings-override.enable=false

** You can add/replace any new config in the similar format in this file.

Check the current directory (optional)

Do a ls -l on the current directory it should have the two files we created above

复制代码
docker-3 % ls -l
-rw-r--r--  1 hadoop  apache  2547 Jun 23 15:53 config
-rw-r--r--  1 hadoop  apache  1533 Jun 23 16:07 docker-compose.yaml
Run the docker containers

Run the docker containers using docker-compose

复制代码
docker-compose up -d

The output should look like:

复制代码
docker-3 % docker-compose up -d    
Creating network "docker-3_default" with the default driver
Creating docker-3_namenode_1        ... done
Creating docker-3_datanode_1        ... done
Creating docker-3_nodemanager_1     ... done
Creating docker-3_resourcemanager_1 ... done
Accessing the Cluster:
Login into a node:

Can login into any node by specifying the container like:

复制代码
docker exec -it docker-3_namenode_1 /bin/bash
Running an example Job (Pi Job)
复制代码
yarn jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.5.jar pi 10 15

The above will run a Pi Job and similarly any hadoop related command can be run.

Accessing the UI

The Namenode UI can be accessed at http://localhost:9870/⁠ and the ResourceManager UI can be accessed at http://localhost:8088/⁠

Shutdown Cluster

The cluster can be shut down via:

复制代码
docker-compose down
Note:

The above example is for Hadoop-3.x line, In case you want to build the Hadoop-2.x, Similar steps but different config & docker-compose.yaml file. Logic can be extracted from: https://github.com/apache/hadoop/tree/docker-hadoop-2⁠

Docker Source Code:

The docker images are built via special branches & the source code for branch 3 lies at https://github.com/apache/hadoop/tree/docker-hadoop-3⁠ and for branch 2 at https://github.com/apache/hadoop/tree/docker-hadoop-2⁠

Reaching out us:

Hadoop Developers can be reached via the hadoop mailing lists: https://hadoop.apache.org/mailing_lists.html⁠

Further Reading

https://hadoop.apache.org/⁠

相关推荐
小Pawn爷4 小时前
4.镜像仓库
docker
江湖有缘6 小时前
零基础入门:使用 Docker 快速部署 Organizr 个人主页
java·服务器·docker
礼拜天没时间.8 小时前
深入Docker架构——C/S模式解析
linux·docker·容器·架构·centos
猫头虎8 小时前
如何使用Docker部署OpenClaw汉化中文版?
运维·人工智能·docker·容器·langchain·开源·aigc
会周易的程序员8 小时前
openplc runtimev4 Docker 部署
运维·c++·物联网·docker·容器·软件工程·iot
无级程序员8 小时前
大数据Hive之拉链表增量取数合并设计(主表加历史表合并成拉链表)
大数据·hive·hadoop
小Pawn爷8 小时前
1.Docker基础
运维·docker·容器
chinesegf8 小时前
清理docker残留镜像images
运维·docker·容器
小Pawn爷9 小时前
2.Docker的存储
运维·docker·容器
礼拜天没时间.10 小时前
自定义镜像制作——从Dockerfile到镜像
linux·docker·容器·centos·bash