Hadoop: Menjalankan MapReduce Job: Difference between revisions

Latest revision as of 01:37, 10 November 2015

Sumber: http://www.bogotobogo.com/Hadoop/BigData_hadoop_Running_MapReduce_Job.php

Jalankan Hadoop

Sebelum melakukan job MapReduce, jangan lupa jalankan Hadoop

start-all.sh

Persiapan MapReduce

Sebelum kita melompat ke dalam pemrograman MapReduce, kita mungkin perlu untuk membicarakan langkah-langkah persiapan yang biasa diambil. Karena MapReduce biasanya beroperasi pada data yang besar, kita perlu mempertimbangkan langkah-langkah sebelum kita benar-benar melakukan MapReduce itu.

Struktur yang mendasari filesystem HDFS sangat berbeda dari sistem file normal kami. Ukuran blok yang sedikit lebih besar, dan ukuran blok yang sebenarnya untuk cluster kami tergantung pada konfigurasi cluster seperti yang ditunjukkan pada gambar di bawah: 64, 128, atau 256 MB. Jadi, kita mungkin perlu memiliki blok dengan dipartisi yang dikustomisasi.

Sumber Gambar : Hadoop MapReduce Fundamentals.

Pertimbangan lain adalah di mana kita akan mengambil data kita dalam rangka untuk melakukan operasi MapReduce atau pemrosesan paralel di atasnya. Meskipun kami akan bekerja dengan Hadoop filesystem, kita dapat mengeksekusi algoritma MapReduce terhadap informasi yang tersimpan di lokasi yang berbeda dengan filesystem native, penyimpanan awan seperti Amazon S3 bucket, atau Windows Azure blob.

Pertimbangan lain adalah output dari MapReduce hasil pekerjaan yang berubah. Jadi, output kami adalah one-time output, dan ketika keluaran baru yang dihasilkan, kita memiliki nama file baru untuk itu.

Pertimbangan terakhir dalam mempersiapkan MapReduce adalah tentang logika yang akan kita tulis, dan harus sesuai dengan situasi yang akan kita atasi. Kita akan menulis logika dalam beberapa bahasa pemrograman, perpustakaan, atau alat untuk memetakan data, dan kemudian mengurangi, dan kemudian kita memiliki beberapa output.

Perhatikan juga bahwa kita akan bekerja dengan pasangan kunci-nilai, sehingga terlepas dari format data yang masuk, kami ingin menampilkan pasangan kunci-nilai.

Perintah Shell Hadoop

Sebelum menjalankan Job MapReduce, kita perlu mengetahui beberapa perintah shell Hadoop. Ada baiknya membaca

Hadoop: Perintah Shell

Menjalankan MapReduce Job

Lakukan

cd /usr/local/hadoop
ls

bin  etc  include  lib  libexec  LICENSE.txt  logs  NOTICE.txt  README.txt  sbin  share

Jalankan

start-all.sh 
cd /usr/local/hadoop
hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar pi 2 5

Number of Maps  = 2
Samples per Map = 5
15/11/09 16:28:55 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Wrote input for Map #0
Wrote input for Map #1
Starting Job
15/11/09 16:28:59 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
15/11/09 16:28:59 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
15/11/09 16:29:00 INFO input.FileInputFormat: Total input paths to process : 2
15/11/09 16:29:00 INFO mapreduce.JobSubmitter: number of splits:2
..
..
..
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	File Input Format Counters 
		Bytes Read=236
	File Output Format Counters 
		Bytes Written=97
Job Finished in 3.729 seconds
Estimated value of Pi is 3.60000000000000000000

Kalau mau iseng coba naikan jumlah map & sample per map, misalnya

hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar pi 20 1000

Hasilnya akan lebih presisi

Job Finished in 5.541 seconds
Estimated value of Pi is 3.14280000000000000000

Atau lebih presisi lagi

hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar pi 200 1000

Hasilnya

Job Finished in 35.865 seconds
Estimated value of Pi is 3.14118000000000000000

Hadoop FileSystem (HDFS)

File disimpan dalam Hadoop Distributed File System (HDFS). Misalkan kita akan menyimpan file bernama data.txt di HDFS.

File ini adalah 160 megabyte. Ketika sebuah file dimuat ke HDFS, itu dibagi menjadi potongan yang disebut blok. Ukuran default setiap blok adalah 64 megabyte. Setiap blok diberi nama yang unik, yang merupakan blk, garis bawah, dan sejumlah besar. Dalam kasus kami, blok pertama adalah 64 megabyte. Blok kedua adalah 64 megabyte. Blok ketiga adalah sisa 32 megabyte, untuk membuat file yang 160 megabyte.

Saat file diupload ke HDFS, setiap blok akan bisa disimpan di salah satu node di cluster. Ada Daemon berjalan pada masing-masing mesin dalam cluster, yang disebut DataNode. Sekarang, kita perlu tahu mana blok membuat file asli. Dan itu ditangani oleh mesin terpisah, menjalankan Daemon yang disebut NameNode. Informasi yang disimpan pada NameNode dikenal sebagai Metadata.

Perintah HDFS

Saat Hadoop jalan, mari membuat hdfsTest.txt di home directory kita:

echo "hdfs test" > hdfsTest.txt

Kemudian, kita ingin membuat Home Directory di HDFS :

hadoop fs -mkdir -p /user/hduser

kita dapat mengcopy file hdfsTest.txt dari local disk ke user directory di HDFS:

hadoop fs -copyFromLocal hdfsTest.txt hdfsTest.txt

Kita juga dapat menggunakan put selain copyFromLocal:

hadoop fs -put hdfsTest.txt

Lihat isi directory dari user home directory di HDFS:

hadoop fs -ls

Found 1 items
-rw-r--r--   1 hduser supergroup          5 2014-07-14 01:49 hdfsTest.txt

Jika kita ingin melihat isi file HDFS /user/hduser/hdfsTest.txt:

hadoop fs -cat /user/hduser/hdfsTest.txt

Kita juga dapat mengcopy file ke local disk dari HDFS, dimakan sebagai hdfsTest2.txt :

hadoop fs -copyToLocal /user/hduser/hdfsTest.txt hdfsTest2.txt

ls
hdfsTest2.txt  hdfsTest.txt

Untuk men-delete file dari Hadoop HDFS:

hadoop fs -rm hdfsTest.txt

hadoop fs -ls

Monitor Job & Task

NameNode daemon: http://hdnode01:50070
JobTracker daemon: http://hdnode01:50030
TaskTracker daemon: http://hdnode01:50060

Referensi

http://www.bogotobogo.com/Hadoop/BigData_hadoop_Running_MapReduce_Job.php

Hadoop: Menjalankan MapReduce Job: Difference between revisions

Latest revision as of 01:37, 10 November 2015

Contents

Jalankan Hadoop

Persiapan MapReduce

Perintah Shell Hadoop

Menjalankan MapReduce Job

Hadoop FileSystem (HDFS)

Perintah HDFS

Monitor Job & Task

Referensi

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools

@@ Line 3: / Line 3: @@
+==Jalankan Hadoop==
+Sebelum melakukan job MapReduce, jangan lupa jalankan Hadoop
+ start-all.sh
-MapReduce Preparation
-Before we jump into the programming our MapReduce, we may need to talk about the preparation steps that are commonly taken. Because MapReduce is usually operating on a huge data, we need to consider those steps before we actually do the MapReduce.
-The underlying structure of the HDFS filesystem is very different from our normal file systems. The block sizes are quite a bit larger, and the actual block size for our clusters dependent on the cluster configuration as shown in the picture below: 64, 128, or 256 MB. So, we may need to have blocks with customized partitioned.
+==Persiapan MapReduce==
-MapRPrep.png
-Picture source : Hadoop MapReduce Fundamentals.
+Sebelum kita melompat ke dalam pemrograman MapReduce, kita mungkin perlu untuk membicarakan langkah-langkah persiapan yang biasa diambil. Karena MapReduce biasanya beroperasi pada data yang besar, kita perlu mempertimbangkan langkah-langkah sebelum kita benar-benar melakukan MapReduce itu.
-Another consideration is where we're going to retrieve our data from in order to perform the MapReduce operations or the parallel processing on it. Though we'll work with the core Hadoop filesystem, we may execute MapReduce algorithms against information stored on different locations such as native filesystem, cloud storage such as Amazon S3 buckets, or Windows Azure blobs.
+Struktur yang mendasari filesystem HDFS sangat berbeda dari sistem file normal kami. Ukuran blok yang sedikit lebih besar, dan ukuran blok yang sebenarnya untuk cluster kami tergantung pada konfigurasi cluster seperti yang ditunjukkan pada gambar di bawah: 64, 128, atau 256 MB. Jadi, kita mungkin perlu memiliki blok dengan dipartisi yang dikustomisasi.
-Another considration is the output of the MapReduce job results are immutable. So, our output is a one-time output, and when a new output is generated, we have a new file name for it.
+[[Image:MapRPrep.png|center|200px|thumb]]
-The last consideration in preparing for MapReduce is about the logic that we'll be writing, and it should fit our situation that we're trying to address. We'll be writing logic in some programming language, library, or tools to map our data to, and then reduce it, and then we have some output.
+Sumber Gambar : Hadoop MapReduce Fundamentals.
-Note also that we'll be working with key-value pairs, so regardless of the format of the data coming in, we want to output key-value pairs.
+Pertimbangan lain adalah di mana kita akan mengambil data kita dalam rangka untuk melakukan operasi MapReduce atau pemrosesan paralel di atasnya. Meskipun kami akan bekerja dengan Hadoop filesystem, kita dapat mengeksekusi algoritma MapReduce terhadap informasi yang tersimpan di lokasi yang berbeda dengan filesystem native, penyimpanan awan seperti Amazon S3 bucket, atau Windows Azure blob.
+Pertimbangan lain adalah output dari MapReduce hasil pekerjaan yang berubah. Jadi, output kami adalah one-time output, dan ketika keluaran baru yang dihasilkan, kita memiliki nama file baru untuk itu.
+Pertimbangan terakhir dalam mempersiapkan MapReduce adalah tentang logika yang akan kita tulis, dan harus sesuai dengan situasi yang akan kita atasi. Kita akan menulis logika dalam beberapa bahasa pemrograman, perpustakaan, atau alat untuk memetakan data, dan kemudian mengurangi, dan kemudian kita memiliki beberapa output.
+Perhatikan juga bahwa kita akan bekerja dengan pasangan kunci-nilai, sehingga terlepas dari format data yang masuk, kami ingin menampilkan pasangan kunci-nilai.
-Hadoop shell commands
-Before performing MapReduce jobs, we should be familiar with some of the Hadoop shell commands. Please visit List of Apache Hadoop hdfs commands.
+==Perintah Shell Hadoop==
+Sebelum menjalankan Job MapReduce, kita perlu mengetahui beberapa perintah shell Hadoop.
+Ada baiknya membaca
+* [[Hadoop: Perintah Shell]]
+==Menjalankan MapReduce Job==
-Running a MapReduce Job
+Lakukan
-Now it's time to run our first Hadoop MapReduce job. We will use one of the examples that come with Hadoop package.
+ cd /usr/local/hadoop
+ ls
-hduser@k:~$ cd /usr/local/hadoop
+ bin  etc  include  lib  libexec  LICENSE.txt  logs  NOTICE.txt  README.txt  sbin  share
-hduser@k:/usr/local/hadoop$ ls
+Jalankan
-bin  include  libexec      logs        README.txt  share
-etc  lib      LICENSE.txt  NOTICE.txt  sbin
-hduser@k:/usr/local/hadoop$ hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.4.1.jar pi 2 5
+ start-all.sh
-Number of Maps  = 2
+ cd /usr/local/hadoop
-Samples per Map = 5
+ hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar pi 2 5
-/07/14 01:28:02 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
-Wrote input for Map #0
-Wrote input for Map #1
-Starting Job
-/07/14 01:28:07 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
-/07/14 01:28:07 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
-/07/14 01:28:07 INFO input.FileInputFormat: Total input paths to process : 2
-/07/14 01:28:07 INFO mapreduce.JobSubmitter: number of splits:2
-/07/14 01:28:09 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1228885165_0001
-...
-	File Input Format Counters
-		Bytes Read=236
-	File Output Format Counters
-		Bytes Written=97
-Job Finished in 6.072 seconds
-Estimated value of Pi is 3.60000000000000000000
-hduser@k:/usr/local/hadoop$
+ Number of Maps  = 2
+ Samples per Map = 5
+/11/09 16:28:55 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
+ Wrote input for Map #0
+ Wrote input for Map #1
+ Starting Job
+/11/09 16:28:59 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
+/11/09 16:28:59 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
+/11/09 16:29:00 INFO input.FileInputFormat: Total input paths to process : 2
+/11/09 16:29:00 INFO mapreduce.JobSubmitter: number of splits:2
+ ..
+ ..
+ ..
+ 	Shuffle Errors
+ 		BAD_ID=0
+ 		CONNECTION=0
+ 		IO_ERROR=0
+ 		WRONG_LENGTH=0
+ 		WRONG_MAP=0
+ 		WRONG_REDUCE=0
+ 	File Input Format Counters
+ 		Bytes Read=236
+ 	File Output Format Counters
+ 		Bytes Written=97
+ Job Finished in 3.729 seconds
+ Estimated value of Pi is 3.60000000000000000000
+Kalau mau iseng coba naikan jumlah map & sample per map, misalnya
+ hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar pi 20 1000
+Hasilnya akan lebih presisi
+ Job Finished in 5.541 seconds
+ Estimated value of Pi is 3.14280000000000000000
+Atau lebih presisi lagi
-Hadoop FileSystem (HDFS)
+ hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar pi 200 1000
-Files are stored in the Hadoop Distributed File System (HDFS). Suppose we're going to store a file called data.txt in HDFS.
+Hasilnya
-This file is 160 megabytes. When a file is loaded into HDFS, it's split into chunks which are called blocks. The default size of each block is 64 megabytes. Each block is given a unique name, which is blk, an underscore, and a large number. In our case, the first block is 64 megabytes. The second block is 64 megabytes. The third block is the remaining 32 megabytes, to make up our 160 megabyte file.
+ Job Finished in 35.865 seconds
+ Estimated value of Pi is 3.14118000000000000000
-HDFS_Cloud.png
+==Hadoop FileSystem (HDFS)==
-As the file is uploaded to HDFS, each block will get stored on one node in the cluster. There's a Daemon running on each of the machines in the cluster, and it is called the DataNode. Now, we need to know which blocks make up the original file. And that's handled by a separate machine, running the Daemon called the NameNode. The information stored on the NameNode is known as the Metadata.
+File disimpan dalam Hadoop Distributed File System (HDFS). Misalkan kita akan menyimpan file bernama data.txt di HDFS.
+File ini adalah 160 megabyte. Ketika sebuah file dimuat ke HDFS, itu dibagi menjadi potongan yang disebut blok. Ukuran default setiap blok adalah 64 megabyte. Setiap blok diberi nama yang unik, yang merupakan blk, garis bawah, dan sejumlah besar. Dalam kasus kami, blok pertama adalah 64 megabyte. Blok kedua adalah 64 megabyte. Blok ketiga adalah sisa 32 megabyte, untuk membuat file yang 160 megabyte.
-NoSQL.png
+[[Image:HDFS Cloud.png|center|200px|thumb]]
-HDFS Commands
+Saat file diupload ke HDFS, setiap blok akan bisa disimpan di salah satu node di cluster. Ada Daemon berjalan pada masing-masing mesin dalam cluster, yang disebut DataNode. Sekarang, kita perlu tahu mana blok membuat file asli. Dan itu ditangani oleh mesin terpisah, menjalankan Daemon yang disebut NameNode. Informasi yang disimpan pada NameNode dikenal sebagai Metadata.
-While Hadoop is running, let's create hdfsTest.txt in our home directory:
-hduser@k:~$ echo "hdfs test" > hdfsTest.txt
+==Perintah HDFS==
-Then, we want to create Home Directory in HDFS :
+Saat Hadoop jalan, mari membuat hdfsTest.txt di home directory kita:
-hduser@ubuntu:~$ hadoop fs -mkdir -p /user/hduser
+ echo "hdfs test" > hdfsTest.txt
-We can copy file hdfsTest.txt from local disk to the user's directory in HDFS:
+Kemudian, kita ingin membuat Home Directory di HDFS :
+ hadoop fs -mkdir -p /user/hduser
-hduser@ubuntu:~$ hadoop fs -copyFromLocal hdfsTest.txt hdfsTest.txt
-We could have used put instead of copyFromLocal:
-hduser@ubuntu:~$ hadoop fs -put hdfsTest.txt
-Get a directory listing of the user's home directory in HDFS:
+kita dapat mengcopy file hdfsTest.txt dari local disk ke user directory di HDFS:
-hduser@k:~$ hadoop fs -ls
-Found 1 items
--rw-r--r--   1 hduser supergroup          5 2014-07-14 01:49 hdfsTest.txt
-If we want to display the contents of the HDFS file /user/hduser/hdfsTest.txt:
-hduser@ubuntu:~$ hadoop fs -cat /user/hduser/hdfsTest.txt
+ hadoop fs -copyFromLocal hdfsTest.txt hdfsTest.txt
-copy that file to the local disk from HDFS, named as hdfsTest2.txt :
+Kita juga dapat menggunakan put selain copyFromLocal:
-hduser@k:~$ hadoop fs -copyToLocal /user/hduser/hdfsTest.txt hdfsTest2.txt
+ hadoop fs -put hdfsTest.txt
-hduser@k:~$ ls
+Lihat isi directory dari user home directory di HDFS:
-hdfsTest2.txt  hdfsTest.txt
-To delete the file from Hadoop HDFS:
+ hadoop fs -ls
-hduser@k:~$ hadoop fs -rm hdfsTest.txt
+ Found 1 items
+ -rw-r--r--   1 hduser supergroup          5 2014-07-14 01:49 hdfsTest.txt
-hduser@k:~$ hadoop fs -ls
+Jika kita ingin melihat isi file HDFS /user/hduser/hdfsTest.txt:
-hduser@k:~$
+ hadoop fs -cat /user/hduser/hdfsTest.txt
+Kita juga dapat mengcopy file ke local disk dari HDFS, dimakan sebagai hdfsTest2.txt :
+ hadoop fs -copyToLocal /user/hduser/hdfsTest.txt hdfsTest2.txt
+ ls
+ hdfsTest2.txt  hdfsTest.txt
-Hadoop Setup for Development
+Untuk men-delete file dari Hadoop HDFS:
-HadoopSetup.png
-Picture source : Hadoop MapReduce Fundamentals.
-Throughout my tutorials on Hadoop Echo Systems, I used:
-    Hadoop Binaries - Local (Linux), Cloudera's Demo VM, and AWS for Cloud.
-    Data Storage - Local (HDFS Pseudo-distributed, single-node) and Cloud.
-    MapReduce - Both Local and Cloud.
-Ways to MapReduce
+ hadoop fs -rm hdfsTest.txt
-Java is the most common language to use, but other languages can be used:
+ hadoop fs -ls
-WayToMapReduces.png
-Picture source : Hadoop MapReduce Fundamentals.
+==Monitor Job & Task==
+* NameNode daemon: http://hdnode01:50070
+* JobTracker daemon: http://hdnode01:50030
+* TaskTracker daemon: http://hdnode01:50060
 ==Referensi==
 * http://www.bogotobogo.com/Hadoop/BigData_hadoop_Running_MapReduce_Job.php