Instalando o Hadoop 3.0.0 em modo pseudodistribuído – VMWare Player – Ubuntu 16.04

Após feita a Instalação do Ubuntu, vamos instalar os filesets necessários para o ambiente hadoop.

Remover caso exista a entrada do CD-ROM do /etc/apt/sources.list

root@hadoop02:~# vi /etc/apt/sources.list
Alterar de
[...]
#

# deb cdrom:[Ubuntu-Server 16.04.3 LTS _Xenial Xerus_ - Release amd64 (20170801)]/ xenial main restricted

deb cdrom:[Ubuntu-Server 16.04.3 LTS _Xenial Xerus_ - Release amd64 (20170801)]/ xenial main restricted
[...]
Para
[...]
#

# deb cdrom:[Ubuntu-Server 16.04.3 LTS _Xenial Xerus_ - Release amd64 (20170801)]/ xenial main restricted

# deb cdrom:[Ubuntu-Server 16.04.3 LTS _Xenial Xerus_ - Release amd64 (20170801)]/ xenial main restricted
[...]

Atualizar o Ubuntu e instalar alguns pacotes necessários

$ sudo apt-get update && sudo apt-get upgrade
$ sudo apt-get install build-essential ssh lzop git rsync curl
$ sudo apt-get install python-dev python-setuptools
$ sudo apt-get install libcurl4-openssl-dev
$ sudo easy_install pip
$ sudo pip install virtualenv virtualenvwrapper python-dateutil

No meu caso ja realizei a criação do usuário hadoop durante a instalação do Ubuntu, caso não tenha criado o usuario hadoop, proceder conforme abaixo:

$ sudo addgroup hadoop
$ sudo adduser --ingroup hadoop hadoop

Configuração do ssh

$ sudo su - hadoop
$ ssh-keygen
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
$ chmod 600 ~/.ssh/authorized_keys
$ ssh -l hadoop localhost
$ exit

Instalando o Java 8

$ sudo apt-get install openjdk-8-jdk
$ sudo apt-get install openjdk-8-dbg
$ java -version
openjdk version "1.8.0_151"
OpenJDK Runtime Environment (build 1.8.0_151-8u151-b12-0ubuntu0.16.04.2-b12)
OpenJDK Zero VM (build 25.151-b12, interpreted mode)
$

Adicionar as linhas entre […] e […] ao final do arquivo. Rebootar as máquinas após alterar o arquivo.

$ sudo vi /etc/sysctl.conf
[...]
# desabilita o ipv6
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1
[...]

Instalando o Hadoop 3.0.0

$ curl -O http://ftp.unicamp.br/pub/apache/hadoop/common/hadoop-3.0.0/hadoop-3.0.0.tar.gz

Extraindo os arquivos

$ tar -xzf hadoop-3.0.0.tar.gz
$ sudo mv hadoop-3.0.0 /srv/
$ sudo chown -R hadoop:hadoop /srv/hadoop-3.0.0
$ sudo chmod g+w -R /srv/hadoop-3.0.0
$ sudo ln -s /srv/hadoop-3.0.0 /srv/hadoop

Configurando as variáveis de ambiente no arquivo ~hadoop/.bashrc. Adicionar as linhas entre […] e […] ao final do arquivo

$ sudo su - hadoop
$ vi ~hadoop/.bashrc
[...]
# Define as variaveis de ambiente do hadoop
export HADOOP_HOME=/srv/hadoop
export PATH=$PATH:$HADOOP_HOME/bin

# Define JAVA_HOME
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
[...]

Testar se configurações do ambiente estão ok

hadoop@hadoop02:~$ source ~hadoop/.bashrc
hadoop@hadoop02:~$ hadoop version
Hadoop 3.0.0
Source code repository https://git-wip-us.apache.org/repos/asf/hadoop.git -r c25427ceca461ee979d30edd7a4b0f50718e6533
Compiled by andrew on 2017-12-08T19:16Z
Compiled with protoc 2.5.0
From source with checksum 397832cb5529187dc8cd74ad54ff22
This command was run using /srv/hadoop-3.0.0/share/hadoop/common/hadoop-common-3.0.0.jar
hadoop@hadoop02:~$

Configurar servicos

Editando o hadoop-env.sh e alterar conforme abaixo:

$ vi $HADOOP_HOME/etc/hadoop/hadoop-env.sh
Alterar de
[...]
# The java implementation to use. By default, this environment
# variable is REQUIRED on ALL platforms except OS X!
# export JAVA_HOME=

# Location of Hadoop. By default, Hadoop will attempt to determine
# this location based upon its execution path.
# export HADOOP_HOME=
[...]
para
[...]
# The java implementation to use. By default, this environment
# variable is REQUIRED on ALL platforms except OS X!
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

# Location of Hadoop. By default, Hadoop will attempt to determine
# this location based upon its execution path.
export HADOOP_HOME=/srv/hadoop
[...]

Configuracao do core-site.xml

$ cat > $HADOOP_HOME/etc/hadoop/core-site.xml << "EOF"
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
 Licensed under the Apache License, Version 2.0 (the "License");
 you may not use this file except in compliance with the License.
 You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
   <property>
      <name>fs.default.name</name>
      <value>hdfs://localhost:9000/</value>
   </property>
   <property>
      <name>hadoop.tmp.dir</name>
      <value>/var/app/hadoop/data</value>
   </property>
</configuration>
EOF

Configuração do mapred-site.xml

$ cat > $HADOOP_HOME/etc/hadoop/mapred-site.xml << "EOF"
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
 Licensed under the Apache License, Version 2.0 (the "License");
 you may not use this file except in compliance with the License.
 You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
   <property>
      <name>mapreduce.framework.name</name>
      <value>yarn</value>
   </property>
   <property>
      <name>yarn.app.mapreduce.am.env</name>
      <value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
   </property>
   <property>
      <name>mapreduce.map.env</name>
      <value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
   </property>
   <property>
       <name>mapreduce.reduce.env</name>
       <value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
   </property>
</configuration>
EOF

Configuração do hdfs-site.xml

$ cat > $HADOOP_HOME/etc/hadoop/hdfs-site.xml << "EOF"
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
 Licensed under the Apache License, Version 2.0 (the "License");
 you may not use this file except in compliance with the License.
 You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
    <property>
       <name>dfs.replication</name>
       <value>1</value>
    </property>
</configuration>
EOF

Configuração do yarn-site.xml

$ cat > $HADOOP_HOME/etc/hadoop/yarn-site.xml << "EOF"
<?xml version="1.0"?>
<!--
 Licensed under the Apache License, Version 2.0 (the "License");
 you may not use this file except in compliance with the License.
 You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License. See accompanying LICENSE file.
-->
<configuration>

<!-- Site specific YARN configuration properties -->
   <property>
      <name>yarn.nodemanager.aux-services</name>
      <value>mapreduce_shuffle</value>
   </property> 
   <property>
      <name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
      <value>org.apache.hadoop.mapred.ShuffleHandler</value> 
   </property> 
   <property>
      <name>yarn.resourcemanager.resource-tracker.address</name>
      <value>localhost:8025</value>
   </property>
   <property>
      <name>yarn.resourcemanager.scheduler.address</name>
      <value>localhost:8030</value>
   </property>
   <property>
      <name>yarn.resourcemanager.address</name>
      <value>localhost:8050</value>
   </property>
</configuration>
EOF

Formatando o namenode

$ sudo mkdir -p /var/app/hadoop/data
$ sudo chown hadoop:hadoop -R /var/app/hadoop
$ $HADOOP_HOME/bin/hadoop namenode -format
[...]
, /var/app/hadoop/data/dfs/name/current/fsimage_0000000000000000000, /var/app/hadoop/data/dfs/name/current/fsimage_0000000000000000000.md5]
2018-01-16 18:33:58,978 INFO common.Storage: Storage directory /var/app/hadoop/data/dfs/name has been successfully formatted.
2018-01-16 18:33:58,992 INFO namenode.FSImageFormatProtobuf: Saving image file /var/app/hadoop/data/dfs/name/current/fsimage.ckpt_0000000000000000000 using no compression
2018-01-16 18:33:59,084 INFO namenode.FSImageFormatProtobuf: Image file /var/app/hadoop/data/dfs/name/current/fsimage.ckpt_0000000000000000000 of size 391 bytes saved in 0 seconds.
2018-01-16 18:33:59,093 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
2018-01-16 18:33:59,098 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at hadoop02/192.168.211.129
************************************************************/
hadoop@hadoop02:~$

Iniciando o Hadoop

hadoop@hadoop02:~$ $HADOOP_HOME/sbin/start-dfs.sh
Starting namenodes on [localhost]
Starting datanodes
Starting secondary namenodes [hadoop02]
hadoop02: Warning: Permanently added 'hadoop02,192.168.211.129' (ECDSA) to the list of known hosts.
hadoop@hadoop02:~$

Validação dos daemons

hadoop@hadoop02:~$ jps
1360 NameNode
1829 Jps
1480 DataNode
1678 SecondaryNameNode
hadoop@hadoop02:~$

 

Start do Yarn

hadoop@hadoop02:~$ $HADOOP_HOME/sbin/start-yarn.sh
Starting resourcemanager
Starting nodemanagers
hadoop@hadoop02:~$

 

Abaixo a lista dos processos em execução

hadoop@hadoop02:~$ jps
1360 NameNode
2400 Jps
2229 NodeManager
1943 ResourceManager
1480 DataNode
1678 SecondaryNameNode
hadoop@hadoop02:~$

Criando uma pasta de testes no HDFS e fazendo o upload de um arquivo:

hadoop@hadoop02:~$ hadoop fs -mkdir /analysis
hadoop@hadoop02:~$ hadoop fs -ls /
Found 1 items
drwxr-xr-x - hadoop supergroup 0 2018-01-16 18:41 /analysis
hadoop@hadoop02:~$

Criando um arquivo e fazendo o upload

hadoop@hadoop02:~$ echo "Testando, arquivo, hdfs" > ./teste.txt
hadoop@hadoop02:~$ hadoop fs -put ./teste.txt /analysis/teste.txt
hadoop@hadoop02:~$ hadoop fs -ls /analysis
Found 1 items
-rw-r--r-- 1 hadoop supergroup 24 2018-01-16 18:42 /analysis/teste.txt
hadoop@hadoop02:~$ hadoop fs -tail /analysis/teste.txt
Testando, arquivo, hdfs
hadoop@hadoop02:~$

Agora podemos checar o status do cluster através do Browser Web. Lembre-se de usar o IP público do NameNode, ResourceManager e HistoryServer respectivamente. Em nosso caso o IP do NameNode IP 192.168.211.129

http://hadoop02:8088

 

Agora iremos executar um job map-reduce de exemplo em nosso Cluster.

hadoop@hadoop02:~$ hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.0.0.jar pi 2 4
Number of Maps = 2
Samples per Map = 4
Wrote input for Map #0
Wrote input for Map #1
Starting Job
2018-01-16 19:03:22,810 INFO client.RMProxy: Connecting to ResourceManager at localhost/127.0.0.1:8050
2018-01-16 19:03:23,120 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/hadoop/.staging/job_1516136552303_0001
2018-01-16 19:03:23,250 INFO input.FileInputFormat: Total input files to process : 2
2018-01-16 19:03:23,314 INFO mapreduce.JobSubmitter: number of splits:2
2018-01-16 19:03:23,357 INFO Configuration.deprecation: yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled
2018-01-16 19:03:24,021 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1516136552303_0001
2018-01-16 19:03:24,030 INFO mapreduce.JobSubmitter: Executing with tokens: []
2018-01-16 19:03:24,293 INFO conf.Configuration: resource-types.xml not found
2018-01-16 19:03:24,293 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2018-01-16 19:03:24,573 INFO impl.YarnClientImpl: Submitted application application_1516136552303_0001
2018-01-16 19:03:24,638 INFO mapreduce.Job: The url to track the job: http://hadoop02:8088/proxy/application_1516136552303_0001/
2018-01-16 19:03:24,638 INFO mapreduce.Job: Running job: job_1516136552303_0001
2018-01-16 19:03:35,676 INFO mapreduce.Job: Job job_1516136552303_0001 running in uber mode : false
2018-01-16 19:03:35,681 INFO mapreduce.Job: map 0% reduce 0%
2018-01-16 19:03:44,894 INFO mapreduce.Job: Task Id : attempt_1516136552303_0001_m_000001_0, Status : FAILED
[2018-01-16 19:03:43.424]Container [pid=2363,containerID=container_1516136552303_0001_01_000003] is running beyond virtual memory limits. Current usage: 148.6 MB of 1 GB physical memory used; 2.4 GB of 2.1 GB virtual memory used. Killing container.
Dump of the process-tree for container_1516136552303_0001_01_000003 :
 |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
 |- 2363 2358 2363 2363 (bash) 0 0 12840960 754 /bin/bash -c /usr/lib/jvm/java-8-openjdk-amd64/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx820m -Djava.io.tmpdir=/var/app/hadoop/data/nm-local-dir/usercache/hadoop/appcache/application_1516136552303_0001/container_1516136552303_0001_01_000003/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/srv/hadoop/logs/userlogs/application_1516136552303_0001/container_1516136552303_0001_01_000003 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA -Dhadoop.root.logfile=syslog org.apache.hadoop.mapred.YarnChild 192.168.211.129 42100 attempt_1516136552303_0001_m_000001_0 3 1>/srv/hadoop/logs/userlogs/application_1516136552303_0001/container_1516136552303_0001_01_000003/stdout 2>/srv/hadoop/logs/userlogs/application_1516136552303_0001/container_1516136552303_0001_01_000003/stderr
 |- 2373 2363 2363 2363 (java) 246 436 2616496128 37296 /usr/lib/jvm/java-8-openjdk-amd64/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx820m -Djava.io.tmpdir=/var/app/hadoop/data/nm-local-dir/usercache/hadoop/appcache/application_1516136552303_0001/container_1516136552303_0001_01_000003/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/srv/hadoop/logs/userlogs/application_1516136552303_0001/container_1516136552303_0001_01_000003 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA -Dhadoop.root.logfile=syslog org.apache.hadoop.mapred.YarnChild 192.168.211.129 42100 attempt_1516136552303_0001_m_000001_0 3

[2018-01-16 19:03:43.533]Container killed on request. Exit code is 143
[2018-01-16 19:03:43.636]Container exited with a non-zero exit code 143.

2018-01-16 19:03:44,919 INFO mapreduce.Job: Task Id : attempt_1516136552303_0001_m_000000_0, Status : FAILED
[2018-01-16 19:03:43.638]Container [pid=2348,containerID=container_1516136552303_0001_01_000002] is running beyond virtual memory limits. Current usage: 157.2 MB of 1 GB physical memory used; 2.5 GB of 2.1 GB virtual memory used. Killing container.
Dump of the process-tree for container_1516136552303_0001_01_000002 :
 |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
 |- 2348 2347 2348 2348 (bash) 0 1 12840960 759 /bin/bash -c /usr/lib/jvm/java-8-openjdk-amd64/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx820m -Djava.io.tmpdir=/var/app/hadoop/data/nm-local-dir/usercache/hadoop/appcache/application_1516136552303_0001/container_1516136552303_0001_01_000002/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/srv/hadoop/logs/userlogs/application_1516136552303_0001/container_1516136552303_0001_01_000002 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA -Dhadoop.root.logfile=syslog org.apache.hadoop.mapred.YarnChild 192.168.211.129 42100 attempt_1516136552303_0001_m_000000_0 2 1>/srv/hadoop/logs/userlogs/application_1516136552303_0001/container_1516136552303_0001_01_000002/stdout 2>/srv/hadoop/logs/userlogs/application_1516136552303_0001/container_1516136552303_0001_01_000002/stderr
 |- 2360 2348 2348 2348 (java) 259 447 2623782912 39476 /usr/lib/jvm/java-8-openjdk-amd64/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx820m -Djava.io.tmpdir=/var/app/hadoop/data/nm-local-dir/usercache/hadoop/appcache/application_1516136552303_0001/container_1516136552303_0001_01_000002/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/srv/hadoop/logs/userlogs/application_1516136552303_0001/container_1516136552303_0001_01_000002 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA -Dhadoop.root.logfile=syslog org.apache.hadoop.mapred.YarnChild 192.168.211.129 42100 attempt_1516136552303_0001_m_000000_0 2

[2018-01-16 19:03:43.696]Container killed on request. Exit code is 143
[2018-01-16 19:03:43.711]Container exited with a non-zero exit code 143.

2018-01-16 19:03:54,022 INFO mapreduce.Job: map 50% reduce 0%
2018-01-16 19:03:55,029 INFO mapreduce.Job: map 100% reduce 0%
2018-01-16 19:03:59,082 INFO mapreduce.Job: map 100% reduce 100%
2018-01-16 19:04:00,125 INFO mapreduce.Job: Job job_1516136552303_0001 completed successfully
2018-01-16 19:04:00,284 INFO mapreduce.Job: Counters: 55
 File System Counters
 FILE: Number of bytes read=50
 FILE: Number of bytes written=617241
 FILE: Number of read operations=0
 FILE: Number of large read operations=0
 FILE: Number of write operations=0
 HDFS: Number of bytes read=530
 HDFS: Number of bytes written=215
 HDFS: Number of read operations=13
 HDFS: Number of large read operations=0
 HDFS: Number of write operations=3
 Job Counters
 Failed map tasks=2
 Launched map tasks=4
 Launched reduce tasks=1
 Other local map tasks=2
 Data-local map tasks=2
 Total time spent by all maps in occupied slots (ms)=28983
 Total time spent by all reduces in occupied slots (ms)=2755
 Total time spent by all map tasks (ms)=28983
 Total time spent by all reduce tasks (ms)=2755
 Total vcore-milliseconds taken by all map tasks=28983
 Total vcore-milliseconds taken by all reduce tasks=2755
 Total megabyte-milliseconds taken by all map tasks=29678592
 Total megabyte-milliseconds taken by all reduce tasks=2821120
 Map-Reduce Framework
 Map input records=2
 Map output records=4
 Map output bytes=36
 Map output materialized bytes=56
 Input split bytes=294
 Combine input records=0
 Combine output records=0
 Reduce input groups=2
 Reduce shuffle bytes=56
 Reduce input records=4
 Reduce output records=0
 Spilled Records=8
 Shuffled Maps =2
 Failed Shuffles=0
 Merged Map outputs=2
 GC time elapsed (ms)=1708
 CPU time spent (ms)=3910
 Physical memory (bytes) snapshot=861880320
 Virtual memory (bytes) snapshot=7954747392
 Total committed heap usage (bytes)=632815616
 Peak Map Physical memory (bytes)=329654272
 Peak Map Virtual memory (bytes)=2652475392
 Peak Reduce Physical memory (bytes)=226496512
 Peak Reduce Virtual memory (bytes)=2663587840
 Shuffle Errors
 BAD_ID=0
 CONNECTION=0
 IO_ERROR=0
 WRONG_LENGTH=0
 WRONG_MAP=0
 WRONG_REDUCE=0
 File Input Format Counters
 Bytes Read=236
 File Output Format Counters
 Bytes Written=97
Job Finished in 37.578 seconds
Estimated value of Pi is 3.50000000000000000000
hadoop@hadoop02:~$

Stop do Cluster Hadoop

hadoop@hadoop02:~$ $HADOOP_HOME/sbin/stop-yarn.sh
hadoop@hadoop02:~$ $HADOOP_HOME/sbin/stop-dfs.sh

 

Douglas Ribas de Mattos
E-mail: douglasmattos0@gmail.com
Github: https://github.com/douglasmattos0
LinkedIn: https://www.linkedin.com/in/douglasmattos0/

Deixe uma resposta

O seu endereço de e-mail não será publicado. Campos obrigatórios são marcados com *