hadoop官方文档:

https://hadoop.apache.org/docs/


安装hadoop集群


配置DNS解析或hosts文件:

cat > /etc/hosts <<EOF
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
10.3.149.20 hadoop-master
10.3.149.21 hadoop-node1
10.3.149.22 hadoop-node2
EOF

配置root用户免秘钥:

ssh-keygen 
ssh-copy-id -i .ssh/id_rsa.pub root@hadoop-master
ssh-copy-id -i .ssh/id_rsa.pub root@hadoop-node1
ssh-copy-id -i .ssh/id_rsa.pub root@hadoop-node2
ssh root@hadoop-master 'date'
ssh root@hadoop-node1 'date'
ssh root@hadoop-node2 'date'

配置hadoop免秘钥:

useradd hadoop
echo '123456' | passwd --stdin hadoop
su hadoop

ssh-keygen
ssh-copy-id -i .ssh/id_rsa.pub hadoop@hadoop-master
ssh-copy-id -i .ssh/id_rsa.pub hadoop@hadoop-node1
ssh-copy-id -i .ssh/id_rsa.pub hadoop@hadoop-node2
ssh hadoop@hadoop-master 'date'
ssh hadoop@hadoop-node1 'date'
ssh hadoop@hadoop-node2 'date'
exit


安装java:

tar -xf jdk-8u231-linux-x64.tar.gz -C /usr/local/

创建软连接:

cd /usr/local/
ln -sv jdk1.8.0_231/ jdk

添加环境变量:

cat > /etc/profile.d/java.sh <<EOF
export JAVA_HOME=/usr/local/jdk
export JRE_HOME=\$JAVA_HOME/jre
export CLASSPATH=.:\$JAVA_HOME/lib/dt.jar:\$JAVA_HOME/lib/tools.jar:\$JRE_HOME/lib
export PATH=\$PATH:\$JAVA_HOME/bin:\$JRE_HOME/bin
EOF
. /etc/profile.d/java.sh

测试是否安装成功:

java -version
javac -version


安装hadoop:

hadoop下载地址:

https://mirrors.tuna.tsinghua.edu.cn/apache/hadoop/common/ 
http://archive.apache.org/dist/hadoop/common/

hadoop2.7版本的:

http://archive.apache.org/dist/hadoop/common/hadoop-2.7.1/hadoop-2.7.1.tar.gz

下载安装包:

wget https://mirrors.tuna.tsinghua.edu.cn/apache/hadoop/common/hadoop-2.10.0/hadoop-2.10.0.tar.gz

解压:

tar -xf hadoop-2.10.0.tar.gz -C /usr/local/
cd /usr/local/
ln -sv hadoop-2.10.0/ hadoop

配置环境变量:

cat > /etc/profile.d/hadoop.sh <<EOF
export HADOOP_HOME=/usr/local/hadoop
export PATH=\$PATH:\$HADOOP_HOME/bin:\$HADOOP_HOME/sbin
EOF

应用环境变量:

. /etc/profile.d/hadoop.sh

创建数据目录:

# master
mkdir -pv  /data/hadoop/hdfs/{nn,snn}
# node
mkdir -pv  /data/hadoop/hdfs/dn


master节点的配置:

进入配置目录:

cd /usr/local/hadoop/etc/hadoop

core-site.xml

<configuration>
    <property>
    <name>fs.defaultFS</name>
   <value>hdfs://hadoop-master:8020</value>
    <final>true</final>
    </property>
</configuration>

yarn-site.xml

<configuration>
    <property>
    <name>yarn.resourcemanager.address</name>
    <value>hadoop-master:8032</value>
    </property>
    <property>
    <name>yarn.resourcemanager.scheduler.address</name>
    <value>hadoop-master:8030</value>
    </property>
    <property>
    <name>yarn.resourcemanager.resource-tracker.address</name>
    <value>hadoop-master:8031</value>
    </property>
    <property>
    <name>yarn.resourcemanager.admin.address</name>
    <value>hadoop-master:8033</value>
    </property>
    <property>
    <name>yarn.resourcemanager.webapp.address</name>
    <value>hadoop-master:8088</value>
    </property>
    <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
    </property>
    <property>
    <name>yarn.nodemanager.auxservices.mapreduce_shuffle.class</name>
    <value>org.apache.hadoop.mapred.ShuffleHandler</value>
    </property>
    <property>
    <name>yarn.resourcemanager.scheduler.class</name>
    <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value>
    </property>
</configuration>

hdfs-site.xml

<configuration>
    <property>
    <name>dfs.replication</name>
    <value>1</value>
    </property>
    <property>
    <name>dfs.namenode.name.dir</name>
    <value>file:///data/hadoop/hdfs/nn</value>
    </property>
    <property>
    <name>dfs.datanode.data.dir</name>
    <value>file:///data/hadoop/hdfs/dn</value>
    </property>
    <property>
    <name>fs.checkpoint.dir</name>
    <value>file:///data/hadoop/hdfs/snn</value>
    </property>
    <property>
    <name>fs.checkpoint.edits.dir</name>
    <value>file:///data/hadoop/hdfs/snn</value>
    </property>
</configuration>

mapred-site.xml

<configuration>
    <property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
    </property>
</configuration>

创建master文件:

cat > master <<EOF
hadoop-master
EOF

创建slave

cat > slaves <<EOF
hadoop-node1
hadoop-node2
EOF


常用配置注解:

http://blog.51yip.com/hadoop/2020.html


node节点上: 

将主节点上的配置复制到node节点即可:

scp ./* root@hadoop-node1:/usr/local/hadoop/etc/hadoop/
scp ./* root@hadoop-node2:/usr/local/hadoop/etc/hadoop/

删除slaves文件:其他配置同master。

rm /usr/local/hadoop/etc/hadoop/slaves -rf


创建日志目录:

mkdir /usr/local/hadoop/logs
chmod g+w /usr/local/hadoop/logs/

改属主属组:

chown -R hadoop.hadoop /data/hadoop/
cd /usr/local/
chown -R hadoop.hadoop hadoop hadoop/


启动与停止集群


格式化hdfs:格式化之后就可以启动集群了

su hadoop
[hadoop@hadoop-master ~]$ hadoop namenode -format


先启动hdfs:从下面的输出可以看出各个节点以及运行的程序。

[hadoop@hadoop-master ~]$ start-dfs.sh 
Starting namenodes on [hadoop-master]
hadoop-master: starting namenode, logging to /usr/local/hadoop-2.10.0/logs/hadoop-hadoop-namenode-hadoop-master.out
hadoop-node2: starting datanode, logging to /usr/local/hadoop-2.10.0/logs/hadoop-hadoop-datanode-hadoop-node2.out
hadoop-node1: starting datanode, logging to /usr/local/hadoop-2.10.0/logs/hadoop-hadoop-datanode-hadoop-node1.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /usr/local/hadoop-2.10.0/logs/hadoop-hadoop-secondarynamenode-hadoop-master.out

查看本节点运行的进程:可以到任意一个节点上使用如下命令。

~]$ jps
1174 Jps
32632 ResourceManager
32012 NameNode
32220 SecondaryNameNode


再启动yarn:可以看到对应的节点启动的进程。

[hadoop@hadoop-master ~]$ start-yarn.sh 
starting yarn daemons
starting resourcemanager, logging to /usr/local/hadoop-2.10.0/logs/yarn-hadoop-resourcemanager-hadoop-master.out
hadoop-node2: starting nodemanager, logging to /usr/local/hadoop-2.10.0/logs/yarn-hadoop-nodemanager-hadoop-node2.out
hadoop-node1: starting nodemanager, logging to /usr/local/hadoop-2.10.0/logs/yarn-hadoop-nodemanager-hadoop-node1.out

或者一次性启动:

[hadoop@hadoop-master ~]$ start-all.sh


查看hadoop集群的运行状态:

hadoop dfsadmin -report

访问概览web页面:

http://10.3.149.20:50070/

集群信息web页面:

http://10.3.149.20:8088/cluster


停止集群:

stop-dfs.sh
stop-yarn.sh

或者:

stop-all.sh


hdfs文件系统的使用


浏览目录:

~]$ hdfs dfs -ls /

创建目录:

~]$ hdfs dfs -mkdir /test

上传文件:

~]$ hdfs dfs -put /etc/fstab /test/fstab

查看文件存储位置:到其中一个datanode上的数据目录就可以查看到这个文件块,默认为128m,超过这个大小文件会分成两块,但是小于128m的文件并不会真正占用128m。

image.png

]$ cat /data/hadoop/hdfs/dn/current/BP-1469813358-10.3.149.20-1595493741225/current/finalized/subdir0/subdir0/blk_1073741825

递归浏览

~]$ hdfs dfs -ls -R /

查看文件:

~]$ hdfs dfs -cat /fstab

更多使用命令帮助:

https://hadoop.apache.org/docs/r2.10.0/hadoop-project-dist/hadoop-common/FileSystemShell.html


统计字符数运算示例:

在  /usr/local/hadoop/share/hadoop/mapreduce  目录中有很多用于计算的示例可以用来测试。

先上传用于测试的文件:

hdfs dfs mkdir /test
hdfs dfs -put /etc/fstab /test/fstab

查看帮助:直接运行程序会给出帮助信息

yarn jar hadoop-mapreduce-examples-2.10.0.jar

测试:这里选择一个单词统计进行测试。

cd /usr/local/hadoop/share/hadoop/mapreduce
]$ yarn jar hadoop-mapreduce-examples-2.10.0.jar wordcount /test/fstab /test/count

可以在下面页面查看到正在运行的任务:

http://10.3.149.20:8088/cluster/apps

查看运算的结果:

]$ hdfs dfs -cat /test/count/part-r-00000


yarn常用命令:

查看运行中的应用:

~]$ yarn application -list

已经运行过的应用:

 ~]$ yarn application -list -appStates=all

查看应用的状态:

~]$ yarn application  -status application_1595496103452_0001