Docker--Apache/hadoop

Apache Hadoop

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

Quickstart

A Hadoop cluster can be created by pulling in the relevant docker image and specifying the required configurations.

Example building the latest hadoop-3 image
Create a basic docker-compose.yaml file like:
复制代码
version: "2"
services:
   namenode:
      image: apache/hadoop:3
      hostname: namenode
      command: ["hdfs", "namenode"]
      ports:
        - 9870:9870
      env_file:
        - ./config
      environment:
          ENSURE_NAMENODE_DIR: "/tmp/hadoop-root/dfs/name"
   datanode:
      image: apache/hadoop:3
      command: ["hdfs", "datanode"]
      env_file:
        - ./config      
   resourcemanager:
      image: apache/hadoop:3
      hostname: resourcemanager
      command: ["yarn", "resourcemanager"]
      ports:
         - 8088:8088
      env_file:
        - ./config
      volumes:
        - ./test.sh:/opt/test.sh
   nodemanager:
      image: apache/hadoop:3
      command: ["yarn", "nodemanager"]
      env_file:
        - ./config

Change the image: apache/hadoop:3 incase you want to build any other image like image: apache/hadoop:3.3.5 for building Apache Hadoop 3.3.5 image

Create a config file like:
复制代码
CORE-SITE.XML_fs.default.name=hdfs://namenode
CORE-SITE.XML_fs.defaultFS=hdfs://namenode
HDFS-SITE.XML_dfs.namenode.rpc-address=namenode:8020
HDFS-SITE.XML_dfs.replication=1
MAPRED-SITE.XML_mapreduce.framework.name=yarn
MAPRED-SITE.XML_yarn.app.mapreduce.am.env=HADOOP_MAPRED_HOME=$HADOOP_HOME
MAPRED-SITE.XML_mapreduce.map.env=HADOOP_MAPRED_HOME=$HADOOP_HOME
MAPRED-SITE.XML_mapreduce.reduce.env=HADOOP_MAPRED_HOME=$HADOOP_HOME
YARN-SITE.XML_yarn.resourcemanager.hostname=resourcemanager
YARN-SITE.XML_yarn.nodemanager.pmem-check-enabled=false
YARN-SITE.XML_yarn.nodemanager.delete.debug-delay-sec=600
YARN-SITE.XML_yarn.nodemanager.vmem-check-enabled=false
YARN-SITE.XML_yarn.nodemanager.aux-services=mapreduce_shuffle
CAPACITY-SCHEDULER.XML_yarn.scheduler.capacity.maximum-applications=10000
CAPACITY-SCHEDULER.XML_yarn.scheduler.capacity.maximum-am-resource-percent=0.1
CAPACITY-SCHEDULER.XML_yarn.scheduler.capacity.resource-calculator=org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator
CAPACITY-SCHEDULER.XML_yarn.scheduler.capacity.root.queues=default
CAPACITY-SCHEDULER.XML_yarn.scheduler.capacity.root.default.capacity=100
CAPACITY-SCHEDULER.XML_yarn.scheduler.capacity.root.default.user-limit-factor=1
CAPACITY-SCHEDULER.XML_yarn.scheduler.capacity.root.default.maximum-capacity=100
CAPACITY-SCHEDULER.XML_yarn.scheduler.capacity.root.default.state=RUNNING
CAPACITY-SCHEDULER.XML_yarn.scheduler.capacity.root.default.acl_submit_applications=*
CAPACITY-SCHEDULER.XML_yarn.scheduler.capacity.root.default.acl_administer_queue=*
CAPACITY-SCHEDULER.XML_yarn.scheduler.capacity.node-locality-delay=40
CAPACITY-SCHEDULER.XML_yarn.scheduler.capacity.queue-mappings=
CAPACITY-SCHEDULER.XML_yarn.scheduler.capacity.queue-mappings-override.enable=false

** You can add/replace any new config in the similar format in this file.

Check the current directory (optional)

Do a ls -l on the current directory it should have the two files we created above

复制代码
docker-3 % ls -l
-rw-r--r--  1 hadoop  apache  2547 Jun 23 15:53 config
-rw-r--r--  1 hadoop  apache  1533 Jun 23 16:07 docker-compose.yaml
Run the docker containers

Run the docker containers using docker-compose

复制代码
docker-compose up -d

The output should look like:

复制代码
docker-3 % docker-compose up -d    
Creating network "docker-3_default" with the default driver
Creating docker-3_namenode_1        ... done
Creating docker-3_datanode_1        ... done
Creating docker-3_nodemanager_1     ... done
Creating docker-3_resourcemanager_1 ... done
Accessing the Cluster:
Login into a node:

Can login into any node by specifying the container like:

复制代码
docker exec -it docker-3_namenode_1 /bin/bash
Running an example Job (Pi Job)
复制代码
yarn jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.5.jar pi 10 15

The above will run a Pi Job and similarly any hadoop related command can be run.

Accessing the UI

The Namenode UI can be accessed at http://localhost:9870/⁠ and the ResourceManager UI can be accessed at http://localhost:8088/⁠

Shutdown Cluster

The cluster can be shut down via:

复制代码
docker-compose down
Note:

The above example is for Hadoop-3.x line, In case you want to build the Hadoop-2.x, Similar steps but different config & docker-compose.yaml file. Logic can be extracted from: https://github.com/apache/hadoop/tree/docker-hadoop-2⁠

Docker Source Code:

The docker images are built via special branches & the source code for branch 3 lies at https://github.com/apache/hadoop/tree/docker-hadoop-3⁠ and for branch 2 at https://github.com/apache/hadoop/tree/docker-hadoop-2⁠

Reaching out us:

Hadoop Developers can be reached via the hadoop mailing lists: https://hadoop.apache.org/mailing_lists.html⁠

Further Reading

https://hadoop.apache.org/⁠

相关推荐
Thexhy1 小时前
CentOS快速安装DockerCE指南
linux·docker
二流子学程序7 小时前
Windows创建一个Docker镜像
docker·容器
G***T69111 小时前
Docker数据分析实战
docker·容器·数据分析
悟能不能悟11 小时前
docker怎么运行jar包
docker·容器·jar
8***B12 小时前
Docker虚拟现实案例
docker·容器·vr
q***428212 小时前
Redis 设置密码(配置文件、docker容器、命令行3种场景)
数据库·redis·docker
J***Q29214 小时前
Docker镜像多平台构建
运维·docker·容器
c***979814 小时前
Docker音频处理案例
运维·docker·容器
永远的超音速15 小时前
Docker入门
docker·电脑