hadoop 2.10集群搭建
hadoop官方文档:
https://hadoop.apache.org/docs/
安装hadoop集群
配置DNS解析或hosts文件:
cat > /etc/hosts <<EOF 127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 ::1 localhost localhost.localdomain localhost6 localhost6.localdomain6 10.3.149.20 hadoop-master 10.3.149.21 hadoop-node1 10.3.149.22 hadoop-node2 EOF
配置root用户免秘钥:
ssh-keygen ssh-copy-id -i .ssh/id_rsa.pub root@hadoop-master ssh-copy-id -i .ssh/id_rsa.pub root@hadoop-node1 ssh-copy-id -i .ssh/id_rsa.pub root@hadoop-node2 ssh root@hadoop-master 'date' ssh root@hadoop-node1 'date' ssh root@hadoop-node2 'date'
配置hadoop免秘钥:
useradd hadoop echo '123456' | passwd --stdin hadoop su hadoop ssh-keygen ssh-copy-id -i .ssh/id_rsa.pub hadoop@hadoop-master ssh-copy-id -i .ssh/id_rsa.pub hadoop@hadoop-node1 ssh-copy-id -i .ssh/id_rsa.pub hadoop@hadoop-node2 ssh hadoop@hadoop-master 'date' ssh hadoop@hadoop-node1 'date' ssh hadoop@hadoop-node2 'date' exit
安装java:
tar -xf jdk-8u231-linux-x64.tar.gz -C /usr/local/
创建软连接:
cd /usr/local/ ln -sv jdk1.8.0_231/ jdk
添加环境变量:
cat > /etc/profile.d/java.sh <<EOF export JAVA_HOME=/usr/local/jdk export JRE_HOME=\$JAVA_HOME/jre export CLASSPATH=.:\$JAVA_HOME/lib/dt.jar:\$JAVA_HOME/lib/tools.jar:\$JRE_HOME/lib export PATH=\$PATH:\$JAVA_HOME/bin:\$JRE_HOME/bin EOF . /etc/profile.d/java.sh
测试是否安装成功:
java -version javac -version
安装hadoop:
hadoop下载地址:
https://mirrors.tuna.tsinghua.edu.cn/apache/hadoop/common/ http://archive.apache.org/dist/hadoop/common/
hadoop2.7版本的:
http://archive.apache.org/dist/hadoop/common/hadoop-2.7.1/hadoop-2.7.1.tar.gz
下载安装包:
wget https://mirrors.tuna.tsinghua.edu.cn/apache/hadoop/common/hadoop-2.10.0/hadoop-2.10.0.tar.gz
解压:
tar -xf hadoop-2.10.0.tar.gz -C /usr/local/ cd /usr/local/ ln -sv hadoop-2.10.0/ hadoop
配置环境变量:
cat > /etc/profile.d/hadoop.sh <<EOF export HADOOP_HOME=/usr/local/hadoop export PATH=\$PATH:\$HADOOP_HOME/bin:\$HADOOP_HOME/sbin EOF
应用环境变量:
. /etc/profile.d/hadoop.sh
创建数据目录:
# master mkdir -pv /data/hadoop/hdfs/{nn,snn} # node mkdir -pv /data/hadoop/hdfs/dn
master节点的配置:
进入配置目录:
cd /usr/local/hadoop/etc/hadoop
core-site.xml
<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://hadoop-master:8020</value> <final>true</final> </property> </configuration>
yarn-site.xml
<configuration> <property> <name>yarn.resourcemanager.address</name> <value>hadoop-master:8032</value> </property> <property> <name>yarn.resourcemanager.scheduler.address</name> <value>hadoop-master:8030</value> </property> <property> <name>yarn.resourcemanager.resource-tracker.address</name> <value>hadoop-master:8031</value> </property> <property> <name>yarn.resourcemanager.admin.address</name> <value>hadoop-master:8033</value> </property> <property> <name>yarn.resourcemanager.webapp.address</name> <value>hadoop-master:8088</value> </property> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.auxservices.mapreduce_shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> <property> <name>yarn.resourcemanager.scheduler.class</name> <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value> </property> </configuration>
hdfs-site.xml
<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:///data/hadoop/hdfs/nn</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>file:///data/hadoop/hdfs/dn</value> </property> <property> <name>fs.checkpoint.dir</name> <value>file:///data/hadoop/hdfs/snn</value> </property> <property> <name>fs.checkpoint.edits.dir</name> <value>file:///data/hadoop/hdfs/snn</value> </property> </configuration>
mapred-site.xml
<configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration>
创建master文件:
cat > master <<EOF hadoop-master EOF
创建slave
cat > slaves <<EOF hadoop-node1 hadoop-node2 EOF
常用配置注解:
http://blog.51yip.com/hadoop/2020.html
node节点上:
将主节点上的配置复制到node节点即可:
scp ./* root@hadoop-node1:/usr/local/hadoop/etc/hadoop/ scp ./* root@hadoop-node2:/usr/local/hadoop/etc/hadoop/
删除slaves文件:其他配置同master。
rm /usr/local/hadoop/etc/hadoop/slaves -rf
创建日志目录:
mkdir /usr/local/hadoop/logs chmod g+w /usr/local/hadoop/logs/
改属主属组:
chown -R hadoop.hadoop /data/hadoop/ cd /usr/local/ chown -R hadoop.hadoop hadoop hadoop/
启动与停止集群
格式化hdfs:格式化之后就可以启动集群了
su hadoop [hadoop@hadoop-master ~]$ hadoop namenode -format
先启动hdfs:从下面的输出可以看出各个节点以及运行的程序。
[hadoop@hadoop-master ~]$ start-dfs.sh Starting namenodes on [hadoop-master] hadoop-master: starting namenode, logging to /usr/local/hadoop-2.10.0/logs/hadoop-hadoop-namenode-hadoop-master.out hadoop-node2: starting datanode, logging to /usr/local/hadoop-2.10.0/logs/hadoop-hadoop-datanode-hadoop-node2.out hadoop-node1: starting datanode, logging to /usr/local/hadoop-2.10.0/logs/hadoop-hadoop-datanode-hadoop-node1.out Starting secondary namenodes [0.0.0.0] 0.0.0.0: starting secondarynamenode, logging to /usr/local/hadoop-2.10.0/logs/hadoop-hadoop-secondarynamenode-hadoop-master.out
查看本节点运行的进程:可以到任意一个节点上使用如下命令。
~]$ jps 1174 Jps 32632 ResourceManager 32012 NameNode 32220 SecondaryNameNode
再启动yarn:可以看到对应的节点启动的进程。
[hadoop@hadoop-master ~]$ start-yarn.sh starting yarn daemons starting resourcemanager, logging to /usr/local/hadoop-2.10.0/logs/yarn-hadoop-resourcemanager-hadoop-master.out hadoop-node2: starting nodemanager, logging to /usr/local/hadoop-2.10.0/logs/yarn-hadoop-nodemanager-hadoop-node2.out hadoop-node1: starting nodemanager, logging to /usr/local/hadoop-2.10.0/logs/yarn-hadoop-nodemanager-hadoop-node1.out
或者一次性启动:
[hadoop@hadoop-master ~]$ start-all.sh
查看hadoop集群的运行状态:
hadoop dfsadmin -report
访问概览web页面:
http://10.3.149.20:50070/
集群信息web页面:
http://10.3.149.20:8088/cluster
停止集群:
stop-dfs.sh stop-yarn.sh
或者:
stop-all.sh
hdfs文件系统的使用
浏览目录:
~]$ hdfs dfs -ls /
创建目录:
~]$ hdfs dfs -mkdir /test
上传文件:
~]$ hdfs dfs -put /etc/fstab /test/fstab
查看文件存储位置:到其中一个datanode上的数据目录就可以查看到这个文件块,默认为128m,超过这个大小文件会分成两块,但是小于128m的文件并不会真正占用128m。
]$ cat /data/hadoop/hdfs/dn/current/BP-1469813358-10.3.149.20-1595493741225/current/finalized/subdir0/subdir0/blk_1073741825
递归浏览
~]$ hdfs dfs -ls -R /
查看文件:
~]$ hdfs dfs -cat /fstab
更多使用命令帮助:
https://hadoop.apache.org/docs/r2.10.0/hadoop-project-dist/hadoop-common/FileSystemShell.html
统计字符数运算示例:
在 /usr/local/hadoop/share/hadoop/mapreduce 目录中有很多用于计算的示例可以用来测试。
先上传用于测试的文件:
hdfs dfs mkdir /test hdfs dfs -put /etc/fstab /test/fstab
查看帮助:直接运行程序会给出帮助信息
yarn jar hadoop-mapreduce-examples-2.10.0.jar
测试:这里选择一个单词统计进行测试。
cd /usr/local/hadoop/share/hadoop/mapreduce ]$ yarn jar hadoop-mapreduce-examples-2.10.0.jar wordcount /test/fstab /test/count
可以在下面页面查看到正在运行的任务:
http://10.3.149.20:8088/cluster/apps
查看运算的结果:
]$ hdfs dfs -cat /test/count/part-r-00000
yarn常用命令:
查看运行中的应用:
~]$ yarn application -list
已经运行过的应用:
~]$ yarn application -list -appStates=all
查看应用的状态:
~]$ yarn application -status application_1595496103452_0001