<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://lms.onnocenter.or.id/wiki/index.php?action=history&amp;feed=atom&amp;title=Hadoop%3A_Python_Map_Reduce_untuk_Hadoop</id>
	<title>Hadoop: Python Map Reduce untuk Hadoop - Revision history</title>
	<link rel="self" type="application/atom+xml" href="https://lms.onnocenter.or.id/wiki/index.php?action=history&amp;feed=atom&amp;title=Hadoop%3A_Python_Map_Reduce_untuk_Hadoop"/>
	<link rel="alternate" type="text/html" href="https://lms.onnocenter.or.id/wiki/index.php?title=Hadoop:_Python_Map_Reduce_untuk_Hadoop&amp;action=history"/>
	<updated>2026-04-23T02:51:43Z</updated>
	<subtitle>Revision history for this page on the wiki</subtitle>
	<generator>MediaWiki 1.45.1</generator>
	<entry>
		<id>https://lms.onnocenter.or.id/wiki/index.php?title=Hadoop:_Python_Map_Reduce_untuk_Hadoop&amp;diff=44828&amp;oldid=prev</id>
		<title>Onnowpurbo: New page: Sumber: http://blog.matthewrathbone.com/2013/11/17/python-map-reduce-on-hadoop---a-beginners-tutorial.html   This article originally accompanied my tutorial session at the Big Data Madison...</title>
		<link rel="alternate" type="text/html" href="https://lms.onnocenter.or.id/wiki/index.php?title=Hadoop:_Python_Map_Reduce_untuk_Hadoop&amp;diff=44828&amp;oldid=prev"/>
		<updated>2015-11-06T03:07:50Z</updated>

		<summary type="html">&lt;p&gt;New page: Sumber: http://blog.matthewrathbone.com/2013/11/17/python-map-reduce-on-hadoop---a-beginners-tutorial.html   This article originally accompanied my tutorial session at the Big Data Madison...&lt;/p&gt;
&lt;p&gt;&lt;b&gt;New page&lt;/b&gt;&lt;/p&gt;&lt;div&gt;Sumber: http://blog.matthewrathbone.com/2013/11/17/python-map-reduce-on-hadoop---a-beginners-tutorial.html&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
This article originally accompanied my tutorial session at the Big Data Madison Meetup, November 2013.&lt;br /&gt;
&lt;br /&gt;
The goal of this article is to:&lt;br /&gt;
&lt;br /&gt;
* introduce you to the hadoop streaming library (the mechanism which allows us to run non-jvm code on hadoop)&lt;br /&gt;
* teach you how to write a simple map reduce pipeline in Python (single input, single output).&lt;br /&gt;
* teach you how to write a more complex pipeline in Python (multiple inputs, single output).&lt;br /&gt;
&lt;br /&gt;
There are other good resouces online about Hadoop streaming, so I’m going over old ground a little. Here are some good links:&lt;br /&gt;
&lt;br /&gt;
* Hadoop Streaming official Documentation&lt;br /&gt;
* Michael Knoll’s Python Streaming Tutorial&lt;br /&gt;
* An Amazon EMR Python streaming tutorial&lt;br /&gt;
&lt;br /&gt;
If you are new to Hadoop, you might want to check out my beginners guide to Hadoop before digging in to any code (it’s a quick read I promise!).&lt;br /&gt;
Setup&lt;br /&gt;
&lt;br /&gt;
I’m going to use the Cloudera Quickstart VM to run these examples.&lt;br /&gt;
&lt;br /&gt;
Once you’re booted into the quickstart VM we’re going to get our dataset. I’m going to use the play-by-play nfl data by Brian Burke. To start with we’re only going to use the data in his Git repository.&lt;br /&gt;
&lt;br /&gt;
Once you’re in the cloudera VM, clone the repo:&lt;br /&gt;
&lt;br /&gt;
 cd ~/workspace&lt;br /&gt;
 git clone https://github.com/eljefe6a/nfldata.git&lt;br /&gt;
&lt;br /&gt;
To start we’re going to use stadiums.csv. However this data was encoded in Windows (grr) so has ^M line separators instead of new lines \n. We need to change the encoding before we can play with it:&lt;br /&gt;
&lt;br /&gt;
 cd workspace/nfldata&lt;br /&gt;
 cat stadiums.csv # BAH! Everything is a single line&lt;br /&gt;
 dos2unix -l -n stadiums.csv unixstadiums.csv&lt;br /&gt;
 cat unixstadiums.csv # Hooray! One stadium per line &lt;br /&gt;
&lt;br /&gt;
Hadoop Streaming Intro&lt;br /&gt;
&lt;br /&gt;
The way you ordinarily run a map-reduce is to write a java program with at least three parts.&lt;br /&gt;
&lt;br /&gt;
    A Main method which configures the job, and lauches it&lt;br /&gt;
        set # reducers&lt;br /&gt;
        set mapper and reducer classes&lt;br /&gt;
        set partitioner&lt;br /&gt;
        set other hadoop configurations&lt;br /&gt;
    A Mapper Class&lt;br /&gt;
        takes K,V inputs, writes K,V outputs&lt;br /&gt;
    A Reducer Class&lt;br /&gt;
        takes K, Iterator[V] inputs, and writes K,V outputs&lt;br /&gt;
&lt;br /&gt;
Hadoop Streaming is actually just a java library that implements these things, but instead of actually doing anything, it pipes data to scripts. By doing so, it provides an API for other languages:&lt;br /&gt;
&lt;br /&gt;
    read from STDIN&lt;br /&gt;
    write to STDOUT&lt;br /&gt;
&lt;br /&gt;
Streaming has some (configurable) conventions that allow it to understand the data returned. Most importantly, it assumes that Keys and Values are separated by a \t. This is important for the rest of the map reduce pipeline to work properly (partitioning and sorting). To understand why check out my intro to Hadoop, where I discuss the pipeline in detail.&lt;br /&gt;
Running a Basic Streaming Job&lt;br /&gt;
&lt;br /&gt;
It’s just like running a normal mapreduce job, except that you need to provide some information about what scripts you want to use.&lt;br /&gt;
&lt;br /&gt;
Hadoop comes with the streaming jar in it’s lib directory, so just find that to use it. The job below counts the number of lines in our stadiums file. (This is really overkill, because there are only 32 records)&lt;br /&gt;
&lt;br /&gt;
 hadoop fs -mkdir nfldata/stadiums&lt;br /&gt;
 hadoop fs -put ~/workspace/nfldata/unixstadiums.csv  nfldata/stadiums/&lt;br /&gt;
 &lt;br /&gt;
 hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.4.0.jar \&lt;br /&gt;
     -Dmapred.reduce.tasks=1 \&lt;br /&gt;
     -input nfldata/stadiums \&lt;br /&gt;
     -output nfldata/output1 \&lt;br /&gt;
     -mapper cat \&lt;br /&gt;
     -reducer &amp;quot;wc -l&amp;quot;&lt;br /&gt;
 &lt;br /&gt;
 # now we check our results:&lt;br /&gt;
 hadoop fs -ls nfldata/output1&lt;br /&gt;
 &lt;br /&gt;
 # looks like files are there, lets get the result:&lt;br /&gt;
 hadoop fs -text nfldata/output1/part*&lt;br /&gt;
 # =&amp;gt; 32&lt;br /&gt;
&lt;br /&gt;
A good way to make sure your job has run properly is to look at the jobtracker dashboard. In the quickstart VM there is a link in the bookmarks bar.&lt;br /&gt;
&lt;br /&gt;
You should see your job in the running/completed sections, clicking on it brings up a bunch of information. The most useful data on this page is under the Map-Reduce Framework section, in particular look for stuff like:&lt;br /&gt;
&lt;br /&gt;
    Map Input Records&lt;br /&gt;
    Map Output Records&lt;br /&gt;
    Reduce Output Records&lt;br /&gt;
&lt;br /&gt;
In our example, input records are 32 and output records is 1:&lt;br /&gt;
&lt;br /&gt;
jobtracker dashboard&lt;br /&gt;
&lt;br /&gt;
==A Simple Example in Python==&lt;br /&gt;
&lt;br /&gt;
Looking in columns.txt we can see that the stadium file has the following fields:&lt;br /&gt;
&lt;br /&gt;
 Stadium (String) - The name of the stadium&lt;br /&gt;
 Capacity (Int) - The capacity of the stadium&lt;br /&gt;
 ExpandedCapacity (Int) - The expanded capacity of the stadium&lt;br /&gt;
 Location (String) - The location of the stadium&lt;br /&gt;
 PlayingSurface (String) - The type of grass, etc that the stadium has&lt;br /&gt;
 IsArtificial (Boolean) - Is the playing surface artificial&lt;br /&gt;
 Team (String) - The name of the team that plays at the stadium&lt;br /&gt;
 Opened (Int) - The year the stadium opened&lt;br /&gt;
 WeatherStation (String) - The name of the weather station closest to the stadium&lt;br /&gt;
 RoofType (Possible Values:None,Retractable,Dome) - The type of roof in the stadium&lt;br /&gt;
 Elevation - The elevation of the stadium&lt;br /&gt;
&lt;br /&gt;
Lets use map reduce to find the number of stadiums with artificial and natrual playing surfaces.&lt;br /&gt;
&lt;br /&gt;
The pseudo-code looks like this:&lt;br /&gt;
&lt;br /&gt;
 def map(line):&lt;br /&gt;
     fields = line.split(&amp;quot;,&amp;quot;)&lt;br /&gt;
     print(fields.isArtificial, 1)&lt;br /&gt;
 &lt;br /&gt;
 def reduce(isArtificial, totals):&lt;br /&gt;
     print(isArtificial, sum(totals))&lt;br /&gt;
&lt;br /&gt;
You can find the finished code in my Hadoop framework examples repository.&lt;br /&gt;
&lt;br /&gt;
==Important Gotcha!==&lt;br /&gt;
&lt;br /&gt;
The reducer interface for streaming is actually different than in Java. Instead of receiving reduce(k, Iterator[V]), your script is actually sent one line per value, including the key.&lt;br /&gt;
&lt;br /&gt;
So for example, instead of receiving:&lt;br /&gt;
&lt;br /&gt;
 reduce(&amp;#039;TRUE&amp;#039;, Iterator(1, 1, 1, 1))&lt;br /&gt;
 reduce(&amp;#039;FALSE&amp;#039;, Iterator(1, 1, 1))&lt;br /&gt;
&lt;br /&gt;
It will receive:&lt;br /&gt;
&lt;br /&gt;
 TRUE 1&lt;br /&gt;
 TRUE 1&lt;br /&gt;
 TRUE 1&lt;br /&gt;
 TRUE 1&lt;br /&gt;
 FALSE 1&lt;br /&gt;
 FALSE 1&lt;br /&gt;
 FALSE 1&lt;br /&gt;
&lt;br /&gt;
This means you have to do a little state tracking in your reducer. This will be demonstrated in the code below.&lt;br /&gt;
&lt;br /&gt;
To follow along, check out my git repository (on the virtual machine):&lt;br /&gt;
&lt;br /&gt;
 cd ~/workspace&lt;br /&gt;
 git clone https://github.com/rathboma/hadoop-framework-examples.git&lt;br /&gt;
 cd hadoop-framework-examples&lt;br /&gt;
&lt;br /&gt;
Mapper&lt;br /&gt;
&lt;br /&gt;
 import sys &lt;br /&gt;
 &lt;br /&gt;
 for line in sys.stdin:&lt;br /&gt;
     line = line.strip()&lt;br /&gt;
     unpacked = line.split(&amp;quot;,&amp;quot;)&lt;br /&gt;
     stadium, capacity, expanded, location, surface, turf, team, opened, weather, roof,  elevation = line.split(&amp;quot;,&amp;quot;)&lt;br /&gt;
     results = [turf, &amp;quot;1&amp;quot;]&lt;br /&gt;
     print(&amp;quot;\t&amp;quot;.join(results))&lt;br /&gt;
&lt;br /&gt;
Reducer&lt;br /&gt;
&lt;br /&gt;
 import sys &lt;br /&gt;
 &lt;br /&gt;
 # Example input (ordered by key)&lt;br /&gt;
 # FALSE 1&lt;br /&gt;
 # FALSE 1&lt;br /&gt;
 # TRUE 1&lt;br /&gt;
 # TRUE 1&lt;br /&gt;
 # UNKNOWN 1&lt;br /&gt;
 # UNKNOWN 1&lt;br /&gt;
 &lt;br /&gt;
 # keys come grouped together&lt;br /&gt;
 # so we need to keep track of state a little bit&lt;br /&gt;
 # thus when the key changes (turf), we need to reset&lt;br /&gt;
 # our counter, and write out the count we&amp;#039;ve accumulated&lt;br /&gt;
 &lt;br /&gt;
 last_turf = None&lt;br /&gt;
 turf_count = 0&lt;br /&gt;
 &lt;br /&gt;
 for line in sys.stdin:&lt;br /&gt;
 &lt;br /&gt;
     line = line.strip()&lt;br /&gt;
     turf, count = line.split(&amp;quot;\t&amp;quot;)  &lt;br /&gt;
 &lt;br /&gt;
     count = int(count)&lt;br /&gt;
     # if this is the first iteration&lt;br /&gt;
     if not last_turf:&lt;br /&gt;
         last_turf = turf&lt;br /&gt;
 &lt;br /&gt;
     # if they&amp;#039;re the same, log it&lt;br /&gt;
     if turf == last_turf:&lt;br /&gt;
         turf_count += count&lt;br /&gt;
     else:&lt;br /&gt;
         # state change (previous line was k=x, this line is k=y)&lt;br /&gt;
         result = [last_turf, turf_count]&lt;br /&gt;
         print(&amp;quot;\t&amp;quot;.join(str(v) for v in result))&lt;br /&gt;
         last_turf = turf&lt;br /&gt;
         turf_count = 1&lt;br /&gt;
 &lt;br /&gt;
 # this is to catch the final counts after all records have been received.&lt;br /&gt;
 print(&amp;quot;\t&amp;quot;.join(str(v) for v in [last_turf, turf_count]))&lt;br /&gt;
&lt;br /&gt;
You might notice that the reducer is significantly more complex then the pseudocode. That is because the streaming interface is limited and cannot really provide a way to implement the standard API.&lt;br /&gt;
&lt;br /&gt;
As noted, each line read contains both the KEY and the VALUE, so it’s up to our reducer to keep track of Key changes and act accordingly.&lt;br /&gt;
&lt;br /&gt;
Don’t forget to make your scripts executable:&lt;br /&gt;
&lt;br /&gt;
 chmod +x simple/mapper.py&lt;br /&gt;
 chmod +x simple/reducer.py&lt;br /&gt;
&lt;br /&gt;
==Testing==&lt;br /&gt;
&lt;br /&gt;
Because our example is so simple, we can actually test it without using hadoop at all.&lt;br /&gt;
&lt;br /&gt;
 cd streaming-python&lt;br /&gt;
 cat ~/workspace/nfldata/unixstadiums.csv | simple/mapper.py | sort | simple/reducer.py&lt;br /&gt;
 # FALSE 15&lt;br /&gt;
 # TRUE 17&lt;br /&gt;
&lt;br /&gt;
Looking good so far!&lt;br /&gt;
&lt;br /&gt;
Running with Hadoop should produce the same output.&lt;br /&gt;
&lt;br /&gt;
 hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.4.0.jar \&lt;br /&gt;
     -mapper mapper.py \&lt;br /&gt;
     -reducer reducer.py \&lt;br /&gt;
     -input nfldata/stadiums \&lt;br /&gt;
     -output nfldata/pythonoutput \&lt;br /&gt;
     -file simple/mapper.py \&lt;br /&gt;
     -file simple/reducer.py&lt;br /&gt;
 # ...twiddle thumbs for a while&lt;br /&gt;
&lt;br /&gt;
 hadoop fs -text nfldata/pythonoutput/part-*&lt;br /&gt;
 FALSE 15&lt;br /&gt;
 TRUE 17&lt;br /&gt;
&lt;br /&gt;
A Complex Example in Python&lt;br /&gt;
&lt;br /&gt;
Check out my Real World Hadoop Guide for Python to see how to join two datasets together using python.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Referensi==&lt;br /&gt;
&lt;br /&gt;
* http://blog.matthewrathbone.com/2013/11/17/python-map-reduce-on-hadoop---a-beginners-tutorial.html&lt;/div&gt;</summary>
		<author><name>Onnowpurbo</name></author>
	</entry>
</feed>