Spark a Hadoop na RPi3 klusteru

Krok 1

Včera 12.9. jsme rozjeli stavbu Spark a Hadoop klusteru a nize na obrazku si muzete zatim prohlednout fyzickou topologii klusteru. Dron se postaral o “za-rackovani”. Pristi tyden dojedem konfiguraci.

  1. RPi3 Cislo 1: hadoop-rpi1 - 192.168.1.14X
  2. RPi3 Cislo 2: hadoop-rpi2 - 192.168.1.14X
  3. RPi3 Cislo 3: hadoop-rpi3 - 192.168.1.144
  4. RPi3 Cislo 4: hadoop-rpi4 - 192.168.1.14X

Krok 2

18.9. jsme s PBem doinstalovali a nakonfigurovali RPi3 cislo 4 do stavu HDFS SingleNode. PB zacal reimagovat zbyle SD karty pro RPi3 cisla 1-3.

Uvodni konfigurace RPi3 s instalaci Hadoop a Spark

# RPI3 -- 192.168.1.144
##Sitova Konfigurace##
#Konfigurace hostname
$HOSTNAME=hadoop-rpi3
hostnamectl set-hostname $HOSTNAME

#Konfigurace /etc/hosts
cat <<EOT>/etc/hosts
127.0.0.1       localhost
::1             localhost ip6-localhost ip6-loopback
ff02::1         ip6-allnodes
ff02::2         ip6-allrouters

127.0.1.1       hadoop-rpi3

192.168.1.14X   hadoop-rpi1
192.168.1.14X   hadoop-rpi2
192.168.1.144   hadoop-rpi3
192.168.1.14X   hadoop-rpi4
EOT

#Nastavit Wifi SSID a passwd
cp /etc/wpa_supplicant/wpa_supplicant.conf{,.bak}
rm -f /etc/wpa_supplicant/wpa_supplicant.conf
cat <<EOT>/etc/wpa_supplicant/wpa_supplicant.conf
country=US
ctrl_interface=DIR=/var/run/wpa_supplicant GROUP=netdev
update_config=1
network={
ssid="LabkaT"
psk="naseskvelehoeslocobychuhad"
}
EOT

#Konfigurace /etc/network/interfaces
cp /etc/network/interfaces{,.bak}
rm -f /etc/network/interfaces
cat <<EOT>/etc/network/interfaces
source-directory /etc/network/interfaces.d
source /etc/network/interfaces.d/*
EOT

#Konfigurace eth0
cat <<EOT >/etc/network/interfaces.d/eth0
auto eth0
iface eth0 inet dhcp
EOT
ifup eth0

#Konfigurace wlan0
cat <<EOT >/etc/network/interfaces.d/wlan0
auto wlan0
iface wlan0 inet dhcp
EOT
ifup wlan0

##Update a instalace nastroju
apt-get update
apt-get upgrade
apt-get install zip unzip ntp lsof sysstat wget

##Konfigurace NTP##
timedatectl set-timezone 'Europe/Prague'
cp /etc/ntp.conf{,.bak}
rm -f /etc/ntp.conf
cat <<EOT>/etc/ntp.conf
driftfile /var/lib/ntp/ntp.drift
statsdir /var/log/ntpstats/
statistics loopstats peerstats clockstats
filegen loopstats file loopstats type day enable
filegen peerstats file peerstats type day enable
filegen clockstats file clockstats type day enable
server ntp.nic.cz iburst prefer
server tik.cesnet.cz iburst
server tak.cesnet.cz iburst
pool 0.debian.pool.ntp.org iburst
pool 1.debian.pool.ntp.org iburst
pool 2.debian.pool.ntp.org iburst
pool 3.debian.pool.ntp.org iburst
restrict -4 default kod notrap nomodify nopeer noquery limited
restrict -6 default kod notrap nomodify nopeer noquery limited
restrict 127.0.0.1
restrict ::1
restrict source notrap nomodify noquery
EOT

/lib/systemd/systemd-sysv-install enable ntp
systemctl stop ntp.service
systemctl start ntp.service

##Instalace a Konfigurace Hadoop a Spark
# pridani hadoop uzivatele
addgroup hadoop
adduser --ingroup hadoop hduser
adduser hduser sudo

#Instalace OpenJDK8
apt-get install openjdk-8-jre --fix-missing

#Instalace Spark
mkdir /opt
wget http://d3kbcqa49mib13.cloudfront.net/spark-2.2.0-bin-hadoop2.7.tgz
tar -xvzf spark-2.2.0-bin-hadoop2.7.tgz -C /opt/
mkdir /opt/spark-2.2.0-bin-hadoop2.7/logs
chown -R hduser /opt/spark-2.2.0-bin-hadoop2.7

#Instalace Hadoop
wget http://apache.osuosl.org/hadoop/common/hadoop-2.7.4/hadoop-2.7.4.tar.gz
tar -xvzf hadoop-2.7.4.tar.gz -C /opt/
mkdir /opt/hadoop-2.7.4/logs/
chown -R hduser:hadoop /opt/hadoop-2.7.4/

#Konfigurace /opt/hadoop-2.7.4/etc/hadoop/masters
cat <<EOT>/opt/hadoop-2.7.4/etc/hadoop/masters
192.168.144
EOT

#Konfigurace /opt/hadoop-2.7.4/etc/hadoop/slaves
cat <<EOT>/opt/hadoop-2.7.4/etc/hadoop/slaves
192.168.14X
192.168.144
192.168.14X
192.168.14X
EOT

#Konfigurace /opt/hadoop-2.7.4/etc/hadoop/mapred-site.xml
cat <<EOT>/opt/hadoop-2.7.4/etc/hadoop/mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
  <value>yarn</value>
</property>
<property>
  <name>mapreduce.map.memory.mb</name>
  <value>256</value>
</property>
<property>
  <name>mapreduce.map.java.opts</name>
  <value>-Xmx204m</value>
</property>
<property>
  <name>mapreduce.reduce.memory.mb</name>
  <value>102</value>
</property>
<property>
  <name>mapreduce.reduce.java.opts</name>
  <value>-Xmx102m</value>
</property>
<property>
  <name>yarn.app.mapreduce.am.resource.mb</name>
  <value>128</value>
</property>
<property>
  <name>yarn.app.mapreduce.am.command-opts</name>
  <value>-Xmx102m</value>
</property>
</configuration>
EOT

#/opt/hadoop-2.7.4/etc/hadoop/hdfs-site.xml
cat <<EOT>/opt/hadoop-2.7.4/etc/hadoop/hdfs-site.xml
<configuration>
<property>
        <name>dfs.replication</name>
        <value>1</value>
</property>
</configuration>
EOT

#Konfigurace /opt/hadoop-2.7.4/etc/hadoop/core-site.xml
cat <<EOT>/opt/hadoop-2.7.4/etc/hadoop/core-site.xml
<configuration>
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://localhost:9000</value>
  </property>
  <property>
          <name>hadoop.tmp.dir</name>
          <value>/hdfs/tmp</value>
  </property>
</configuration>
EOT

#Konfigurace /opt/hadoop-2.7.4/etc/hadoop/yarn-site.xml
cat <<EOT>/opt/hadoop-2.7.4/etc/hadoop/yarn-site.xml
<configuration>
<!-- Site specific YARN configuration properties -->
  <property>
          <name>yarn.resourcemanager.resource-tracker.address</name>
          <value>192.168.1.144:8025</value>
  </property>
  <property>
          <name>yarn.resourcemanager.scheduler.address</name>
          <value>192.168.1.144:8030</value>
  </property>
  <property>
          <name>yarn.resourcemanager.address</name>
          <value>192.168.1.144:8050</value>
  </property>
  <property>
          <name>yarn.nodemanager.aux-services</name>
          <value>mapreduce_shuffle</value>
  </property>
  <property>
          <name>yarn.nodemanager.resource.cpu-vcores</name>
          <value>4</value>
  </property>
  <property>
          <name>yarn.nodemanager.resource.memory-mb</name>
          <value>1024</value>
  </property>
  <property>
          <name>yarn.scheduler.minimum-allocation-mb</name>
          <value>128</value>
  </property>
  <property>
          <name>yarn.scheduler.maximum-allocation-mb</name>
          <value>1024</value>
  </property>
  <property>
          <name>yarn.scheduler.minimum-allocation-vcores</name>
          <value>1</value>
  </property>
  <property>
          <name>yarn.scheduler.maximum-allocation-vcores</name>
          <value>4</value>
  </property>
  <property>
          <name>yarn.nodemanager.vmem-check-enabled</name>
          <value>false</value>
  </property>
  <property>
          <name>yarn.nodemanager.pmem-check-enabled</name>
          <value>true</value>
  </property>
  <property>
          <name>yarn.nodemanager.vmem-pmem-ratio</name>
          <value>4</value>
  </property>
  <property>
          <name>yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage</name>
          <value>98.5</value>
  </property>
</configuration>

#Konfigurace podle $SPARK_HOME/conf/spark-env.sh.template
cat <<EOT>/opt/spark-2.1.0-bin-hadoop2.7/conf/spark-env.sh
#!/usr/bin/env bash
SPARK_MASTER_HOST=192.168.1.144
SPARK_WORKER_MEMORY=512m
EOT

#Konfigurace /opt/hadoop-2.7.4/etc/hadoop/hadoop-env.sh
cat <<EOT>/opt/hadoop-2.7.4/etc/hadoop/hadoop-env.sh
#!/usr/bin/env bash
# Set Hadoop-specific environment variables here.
#export JAVA_HOME=${JAVA_HOME}
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-armhf/jre
export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-"/opt/hadoop-2.7.4/etc/hadoop"}

# Extra Java CLASSPATH elements.  Automatically insert capacity-scheduler.
for f in $HADOOP_HOME/contrib/capacity-scheduler/*.jar; do
  if [ "$HADOOP_CLASSPATH" ]; then
    export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$f
  else
    export HADOOP_CLASSPATH=$f
  fi
done

# The maximum amount of heap to use, in MB. Default is 1000.
#export HADOOP_HEAPSIZE=
#export HADOOP_NAMENODE_INIT_HEAPSIZE=""

# Extra Java runtime options.  Empty by default.
export HADOOP_OPTS="$HADOOP_OPTS -Djava.net.preferIPv4Stack=true"

# Command specific options appended to HADOOP_OPTS when specified
export HADOOP_NAMENODE_OPTS="-Dhadoop.security.logger=${HADOOP_SECURITY_LOGGER:-INFO,RFAS} -Dhdfs.audit.logger=${HDFS_AUDIT_LOGGER:-INFO,NullAppender} $HADOOP_NAMENODE_OPTS"
export HADOOP_DATANODE_OPTS="-Dhadoop.security.logger=ERROR,RFAS $HADOOP_DATANODE_OPTS"

export HADOOP_SECONDARYNAMENODE_OPTS="-Dhadoop.security.logger=${HADOOP_SECURITY_LOGGER:-INFO,RFAS} -Dhdfs.audit.logger=${HDFS_AUDIT_LOGGER:-INFO,NullAppender} $HADOOP_SECONDARYNAMENODE_OPTS"

export HADOOP_NFS3_OPTS="$HADOOP_NFS3_OPTS"
export HADOOP_PORTMAP_OPTS="-Xmx512m $HADOOP_PORTMAP_OPTS"

# The following applies to multiple commands (fs, dfs, fsck, distcp etc)
export HADOOP_CLIENT_OPTS="-Xmx512m $HADOOP_CLIENT_OPTS"
#HADOOP_JAVA_PLATFORM_OPTS="-XX:-UsePerfData $HADOOP_JAVA_PLATFORM_OPTS"

# On secure datanodes, user to run the datanode as after dropping privileges.
# This **MUST** be uncommented to enable secure HDFS if using privileged ports
# to provide authentication of data transfer protocol.  This **MUST NOT** be
# defined if SASL is configured for authentication of data transfer protocol
# using non-privileged ports.
export HADOOP_SECURE_DN_USER=${HADOOP_SECURE_DN_USER}

# Where log files are stored.  $HADOOP_HOME/logs by default.
#export HADOOP_LOG_DIR=${HADOOP_LOG_DIR}/$USER

# Where log files are stored in the secure data environment.
export HADOOP_SECURE_DN_LOG_DIR=${HADOOP_LOG_DIR}/${HADOOP_HDFS_USER}

# HDFS Mover specific parameters
# Specify the JVM options to be used when starting the HDFS Mover.
# These options will be appended to the options specified as HADOOP_OPTS
# and therefore may override any similar flags set in HADOOP_OPTS
# export HADOOP_MOVER_OPTS=""
export HADOOP_PID_DIR=${HADOOP_PID_DIR}
export HADOOP_SECURE_DN_PID_DIR=${HADOOP_PID_DIR}

export HADOOP_IDENT_STRING=$USER
EOT

#Pridani do /home/hduser/.bashrc
cat <<EOT>>/home/hduser/.bashrc
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-armhf/jre
export HADOOP_HOME=/opt/hadoop-2.7.4
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
export SPARK_HOME=/opt/spark-2.1.0-bin-hadoop2.7
export PATH=$PATH:$SPARK_HOME/bin
EOT
source /home/hduser/.bashrc

#Osetrime vlastnictvi
chown -R hduser:hadoop /home/hduser/
chown -R hduser:hadoop /opt/hadoop-2.7.4/
chown -R hduser /opt/spark-2.1.0-bin-hadoop2.7

#Konfigurace /hdfs/tmp
mkdir -p /hdfs/tmp
chown hduser:hadoop /hdfs/tmp
chmod 750 /hdfs/tmp
hdfs namenode -format

TODO Integrace s LDAP
http://www.openldap.org/lists/openldap-technical/201507/msg00100.html
https://hortonworks.com/blog/hadoop-groupmapping-ldap-integration/
https://gist.github.com/laurentedel/60c8d02254d7439a7ef7

Testovani Hadoop a Spark

http://IP_OF_YOUR_SERVER:8088/

##Hadoop a Spark
#Spusteni
su hduser
source /home/hduser/.bashrc
hdfs namenode -format
/opt/hadoop-2.7.4/sbin/start-dfs.sh
hdfs dfs -mkdir /user
hdfs dfs -mkdir /user/hduser
/opt/hadoop-2.7.4/sbin/start-yarn.sh

#Kontrola spustenych roli
jps

#Test hadoop verze
hdfs dfs -put /opt/hadoop-2.7.4/etc/hadoop input
hadoop jar /opt/hadoop-2.7.4/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.4.jar grep input output 'dfs[a-z.]+'
hdfs dfs -get output output
cat output/*

#Spustit spark Job
spark-submit –class com.learning.spark.SparkWordCount –master yarn –executor-memory 512m ~/word_count-0.0.1-SNAPSHOT.jar /ntallapa/word_count/text 2

#Spustit example mapreduce job
hadoop jar /opt/hadoop-2.7.4/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.4.jar wordcount /ntallapa/word_count/text /ntallapa/word_count/output

#Vycistit
rm -r /hdfs/tmp/dfs/data/current

Krok 3

Prilepit switch Cisco 2960 do interni VLANy. Embargo prislibil rezevaci IP adres na dhcp.

Krok 4

Doimagovat RPi3 cisla 1-3 a pridat je do Hadoop klusteru.

Krok 5

Instalace Apache Hive.

https://www.tutorialspoint.com/hive/hive_installation.htm

Krok 6

Zookeper

http://bigdataschool.blogspot.cz/2016/02/how-to-setup-3-node-apache-zookeeper.html
http://www.thegeekstuff.com/2016/10/zookeeper-cluster-install/

https://www.cloudera.com/documentation/enterprise/5-8-x/topics/cdh_ig_zookeeper_package_install.html

yum install https://archive.cloudera.com/cdh5/one-click-install/redhat/7/x86_64/cloudera-cdh-5-0.x86_64.rpm

Krok 7

Apache Flume

https://dpaynedudhe.wordpress.com/2015/06/16/installing-flume-on-ubuntu/
http://hadooptutorial.info/apache-flume-installation/

Krok 8

Apache Oozie

https://www.tecmint.com/install-apache-oozie-for-cdh-in-centos/
https://oozie.apache.org/docs/4.2.0/AG_Install.html
http://www.rohitmenon.com/index.php/apache-oozie-installation/

Krok 9

Instalace Hadoop PCAP library.

Krok 10

Testy:

  • Analyza sitoveho provozu pomoci PCAP knihovny

https://github.com/RIPE-NCC/hadoop-pcap

  • Analyza Twitter feedu pomoci Apache Flume, Apache HDFS, Apache Oozie a Apache Hive

http://blog.cloudera.com/blog/2012/09/analyzing-twitter-data-with-hadoop/