PageRenderTime 50ms CodeModel.GetById 13ms RepoModel.GetById 0ms app.codeStats 0ms

/projects/hadoop-1.1.2/docs/hadoop_archives.html

https://gitlab.com/essere.lab.public/qualitas.class-corpus
HTML | 422 lines | 344 code | 13 blank | 65 comment | 0 complexity | adff288318d2cd44c35eacbbcad3c4f5 MD5 | raw file
  1. <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
  2. <html>
  3. <head>
  4. <META http-equiv="Content-Type" content="text/html; charset=UTF-8">
  5. <meta content="Apache Forrest" name="Generator">
  6. <meta name="Forrest-version" content="0.8">
  7. <meta name="Forrest-skin-name" content="pelt">
  8. <title>Hadoop Archives Guide</title>
  9. <link type="text/css" href="skin/basic.css" rel="stylesheet">
  10. <link media="screen" type="text/css" href="skin/screen.css" rel="stylesheet">
  11. <link media="print" type="text/css" href="skin/print.css" rel="stylesheet">
  12. <link type="text/css" href="skin/profile.css" rel="stylesheet">
  13. <script src="skin/getBlank.js" language="javascript" type="text/javascript"></script><script src="skin/getMenu.js" language="javascript" type="text/javascript"></script><script src="skin/fontsize.js" language="javascript" type="text/javascript"></script>
  14. <link rel="shortcut icon" href="images/favicon.ico">
  15. </head>
  16. <body onload="init()">
  17. <script type="text/javascript">ndeSetTextSize();</script>
  18. <div id="top">
  19. <!--+
  20. |breadtrail
  21. +-->
  22. <div class="breadtrail">
  23. <a href="http://www.apache.org/">Apache</a> &gt; <a href="http://hadoop.apache.org/">Hadoop</a> &gt; <a href="http://hadoop.apache.org/core/">Core</a><script src="skin/breadcrumbs.js" language="JavaScript" type="text/javascript"></script>
  24. </div>
  25. <!--+
  26. |header
  27. +-->
  28. <div class="header">
  29. <!--+
  30. |start group logo
  31. +-->
  32. <div class="grouplogo">
  33. <a href="http://hadoop.apache.org/"><img class="logoImage" alt="Hadoop" src="images/hadoop-logo.jpg" title="Apache Hadoop"></a>
  34. </div>
  35. <!--+
  36. |end group logo
  37. +-->
  38. <!--+
  39. |start Project Logo
  40. +-->
  41. <div class="projectlogo">
  42. <a href="http://hadoop.apache.org/core/"><img class="logoImage" alt="Hadoop" src="images/hadoop-logo-2.gif" title="Scalable Computing Platform"></a>
  43. </div>
  44. <!--+
  45. |end Project Logo
  46. +-->
  47. <!--+
  48. |start Search
  49. +-->
  50. <div class="searchbox">
  51. <form action="http://www.google.com/search" method="get" class="roundtopsmall">
  52. <input value="hadoop.apache.org" name="sitesearch" type="hidden"><input onFocus="getBlank (this, 'Search the site with google');" size="25" name="q" id="query" type="text" value="Search the site with google">&nbsp;
  53. <input name="Search" value="Search" type="submit">
  54. </form>
  55. </div>
  56. <!--+
  57. |end search
  58. +-->
  59. <!--+
  60. |start Tabs
  61. +-->
  62. <ul id="tabs">
  63. <li>
  64. <a class="unselected" href="http://hadoop.apache.org/core/">Project</a>
  65. </li>
  66. <li>
  67. <a class="unselected" href="http://wiki.apache.org/hadoop">Wiki</a>
  68. </li>
  69. <li class="current">
  70. <a class="selected" href="index.html">Hadoop 1.1.2 Documentation</a>
  71. </li>
  72. </ul>
  73. <!--+
  74. |end Tabs
  75. +-->
  76. </div>
  77. </div>
  78. <div id="main">
  79. <div id="publishedStrip">
  80. <!--+
  81. |start Subtabs
  82. +-->
  83. <div id="level2tabs"></div>
  84. <!--+
  85. |end Endtabs
  86. +-->
  87. <script type="text/javascript"><!--
  88. document.write("Last Published: " + document.lastModified);
  89. // --></script>
  90. </div>
  91. <!--+
  92. |breadtrail
  93. +-->
  94. <div class="breadtrail">
  95. &nbsp;
  96. </div>
  97. <!--+
  98. |start Menu, mainarea
  99. +-->
  100. <!--+
  101. |start Menu
  102. +-->
  103. <div id="menu">
  104. <div onclick="SwitchMenu('menu_1.1', 'skin/')" id="menu_1.1Title" class="menutitle">Getting Started</div>
  105. <div id="menu_1.1" class="menuitemgroup">
  106. <div class="menuitem">
  107. <a href="index.html">Overview</a>
  108. </div>
  109. <div class="menuitem">
  110. <a href="single_node_setup.html">Single Node Setup</a>
  111. </div>
  112. <div class="menuitem">
  113. <a href="cluster_setup.html">Cluster Setup</a>
  114. </div>
  115. </div>
  116. <div onclick="SwitchMenu('menu_1.2', 'skin/')" id="menu_1.2Title" class="menutitle">Guides</div>
  117. <div id="menu_1.2" class="menuitemgroup">
  118. <div class="menuitem">
  119. <a href="HttpAuthentication.html">Authentication for Hadoop HTTP web-consoles</a>
  120. </div>
  121. </div>
  122. <div onclick="SwitchMenu('menu_selected_1.3', 'skin/')" id="menu_selected_1.3Title" class="menutitle" style="background-image: url('skin/images/chapter_open.gif');">MapReduce</div>
  123. <div id="menu_selected_1.3" class="selectedmenuitemgroup" style="display: block;">
  124. <div class="menuitem">
  125. <a href="mapred_tutorial.html">MapReduce Tutorial</a>
  126. </div>
  127. <div class="menuitem">
  128. <a href="streaming.html">Hadoop Streaming</a>
  129. </div>
  130. <div class="menuitem">
  131. <a href="commands_manual.html">Hadoop Commands</a>
  132. </div>
  133. <div class="menuitem">
  134. <a href="distcp.html">DistCp</a>
  135. </div>
  136. <div class="menuitem">
  137. <a href="vaidya.html">Vaidya</a>
  138. </div>
  139. <div class="menupage">
  140. <div class="menupagetitle">Hadoop Archives</div>
  141. </div>
  142. <div class="menuitem">
  143. <a href="gridmix.html">Gridmix</a>
  144. </div>
  145. <div class="menuitem">
  146. <a href="rumen.html">Rumen</a>
  147. </div>
  148. <div class="menuitem">
  149. <a href="capacity_scheduler.html">Capacity Scheduler</a>
  150. </div>
  151. <div class="menuitem">
  152. <a href="fair_scheduler.html">Fair Scheduler</a>
  153. </div>
  154. <div class="menuitem">
  155. <a href="hod_scheduler.html">Hod Scheduler</a>
  156. </div>
  157. </div>
  158. <div onclick="SwitchMenu('menu_1.4', 'skin/')" id="menu_1.4Title" class="menutitle">HDFS</div>
  159. <div id="menu_1.4" class="menuitemgroup">
  160. <div class="menuitem">
  161. <a href="hdfs_user_guide.html">HDFS Users </a>
  162. </div>
  163. <div class="menuitem">
  164. <a href="hdfs_design.html">HDFS Architecture</a>
  165. </div>
  166. <div class="menuitem">
  167. <a href="hdfs_permissions_guide.html">Permissions</a>
  168. </div>
  169. <div class="menuitem">
  170. <a href="hdfs_quota_admin_guide.html">Quotas</a>
  171. </div>
  172. <div class="menuitem">
  173. <a href="SLG_user_guide.html">Synthetic Load Generator</a>
  174. </div>
  175. <div class="menuitem">
  176. <a href="webhdfs.html">WebHDFS REST API</a>
  177. </div>
  178. <div class="menuitem">
  179. <a href="libhdfs.html">C API libhdfs</a>
  180. </div>
  181. </div>
  182. <div onclick="SwitchMenu('menu_1.5', 'skin/')" id="menu_1.5Title" class="menutitle">Common</div>
  183. <div id="menu_1.5" class="menuitemgroup">
  184. <div class="menuitem">
  185. <a href="deployment_layout.html">Deployment Layout</a>
  186. </div>
  187. <div class="menuitem">
  188. <a href="file_system_shell.html">File System Shell</a>
  189. </div>
  190. <div class="menuitem">
  191. <a href="service_level_auth.html">Service Level Authorization</a>
  192. </div>
  193. <div class="menuitem">
  194. <a href="native_libraries.html">Native Libraries</a>
  195. </div>
  196. </div>
  197. <div onclick="SwitchMenu('menu_1.6', 'skin/')" id="menu_1.6Title" class="menutitle">Miscellaneous</div>
  198. <div id="menu_1.6" class="menuitemgroup">
  199. <div class="menuitem">
  200. <a href="Secure_Impersonation.html">Secure Impersonation</a>
  201. </div>
  202. <div class="menuitem">
  203. <a href="api/index.html">API Docs</a>
  204. </div>
  205. <div class="menuitem">
  206. <a href="jdiff/changes.html">API Changes</a>
  207. </div>
  208. <div class="menuitem">
  209. <a href="http://wiki.apache.org/hadoop/">Wiki</a>
  210. </div>
  211. <div class="menuitem">
  212. <a href="http://wiki.apache.org/hadoop/FAQ">FAQ</a>
  213. </div>
  214. <div class="menuitem">
  215. <a href="releasenotes.html">Release Notes</a>
  216. </div>
  217. <div class="menuitem">
  218. <a href="changes.html">Change Log</a>
  219. </div>
  220. </div>
  221. <div id="credit"></div>
  222. <div id="roundbottom">
  223. <img style="display: none" class="corner" height="15" width="15" alt="" src="skin/images/rc-b-l-15-1body-2menu-3menu.png"></div>
  224. <!--+
  225. |alternative credits
  226. +-->
  227. <div id="credit2"></div>
  228. </div>
  229. <!--+
  230. |end Menu
  231. +-->
  232. <!--+
  233. |start content
  234. +-->
  235. <div id="content">
  236. <div title="Portable Document Format" class="pdflink">
  237. <a class="dida" href="hadoop_archives.pdf"><img alt="PDF -icon" src="skin/images/pdfdoc.gif" class="skin"><br>
  238. PDF</a>
  239. </div>
  240. <h1>Hadoop Archives Guide</h1>
  241. <div id="minitoc-area">
  242. <ul class="minitoc">
  243. <li>
  244. <a href="#Overview">Overview</a>
  245. </li>
  246. <li>
  247. <a href="#How+to+Create+an+Archive">How to Create an Archive</a>
  248. </li>
  249. <li>
  250. <a href="#How+to+Look+Up+Files+in+Archives">How to Look Up Files in Archives</a>
  251. </li>
  252. <li>
  253. <a href="#Archives+Examples">Archives Examples</a>
  254. <ul class="minitoc">
  255. <li>
  256. <a href="#Creating+an+Archive">Creating an Archive</a>
  257. </li>
  258. <li>
  259. <a href="#Looking+Up+Files"> Looking Up Files</a>
  260. </li>
  261. </ul>
  262. </li>
  263. <li>
  264. <a href="#Hadoop+Archives+and+MapReduce">Hadoop Archives and MapReduce </a>
  265. </li>
  266. </ul>
  267. </div>
  268. <a name="N1000D"></a><a name="Overview"></a>
  269. <h2 class="h3">Overview</h2>
  270. <div class="section">
  271. <p>
  272. Hadoop archives are special format archives. A Hadoop archive
  273. maps to a file system directory. A Hadoop archive always has a *.har
  274. extension. A Hadoop archive directory contains metadata (in the form
  275. of _index and _masterindex) and data (part-*) files. The _index file contains
  276. the name of the files that are part of the archive and the location
  277. within the part files.
  278. </p>
  279. </div>
  280. <a name="N10017"></a><a name="How+to+Create+an+Archive"></a>
  281. <h2 class="h3">How to Create an Archive</h2>
  282. <div class="section">
  283. <p>
  284. <span class="codefrag">Usage: hadoop archive -archiveName name -p &lt;parent&gt; &lt;src&gt;* &lt;dest&gt;</span>
  285. </p>
  286. <p>
  287. -archiveName is the name of the archive you would like to create.
  288. An example would be foo.har. The name should have a *.har extension.
  289. The parent argument is to specify the relative path to which the files should be
  290. archived to. Example would be :
  291. </p>
  292. <p>
  293. <span class="codefrag"> -p /foo/bar a/b/c e/f/g </span>
  294. </p>
  295. <p>
  296. Here /foo/bar is the parent path and a/b/c, e/f/g are relative paths to parent.
  297. Note that this is a Map/Reduce job that creates the archives. You would
  298. need a map reduce cluster to run this. For a detailed example the later sections. </p>
  299. <p> If you just want to archive a single directory /foo/bar then you can just use </p>
  300. <p>
  301. <span class="codefrag"> hadoop archive -archiveName zoo.har -p /foo/bar /outputdir </span>
  302. </p>
  303. </div>
  304. <a name="N10033"></a><a name="How+to+Look+Up+Files+in+Archives"></a>
  305. <h2 class="h3">How to Look Up Files in Archives</h2>
  306. <div class="section">
  307. <p>
  308. The archive exposes itself as a file system layer. So all the fs shell
  309. commands in the archives work but with a different URI. Also, note that
  310. archives are immutable. So, rename's, deletes and creates return
  311. an error. URI for Hadoop Archives is
  312. </p>
  313. <p>
  314. <span class="codefrag">har://scheme-hostname:port/archivepath/fileinarchive</span>
  315. </p>
  316. <p>
  317. If no scheme is provided it assumes the underlying filesystem.
  318. In that case the URI would look like </p>
  319. <p>
  320. <span class="codefrag">har:///archivepath/fileinarchive</span>
  321. </p>
  322. </div>
  323. <a name="N10046"></a><a name="Archives+Examples"></a>
  324. <h2 class="h3">Archives Examples</h2>
  325. <div class="section">
  326. <a name="N1004C"></a><a name="Creating+an+Archive"></a>
  327. <h3 class="h4">Creating an Archive</h3>
  328. <p>
  329. <span class="codefrag">hadoop archive -archiveName foo.har -p /user/hadoop dir1 dir2 /user/zoo </span>
  330. </p>
  331. <p>
  332. The above example is creating an archive using /user/hadoop as the relative archive directory.
  333. The directories /user/hadoop/dir1 and /user/hadoop/dir2 will be
  334. archived in the following file system directory -- /user/zoo/foo.har. Archiving does not delete the input
  335. files. If you want to delete the input files after creating the archives (to reduce namespace), you
  336. will have to do it on your own.
  337. </p>
  338. <a name="N1005A"></a><a name="Looking+Up+Files"></a>
  339. <h3 class="h4"> Looking Up Files</h3>
  340. <p> Looking up files in hadoop archives is as easy as doing an ls on the filesystem. After you have
  341. archived the directories /user/hadoop/dir1 and /user/hadoop/dir2 as in the example above, to see all
  342. the files in the archives you can just run: </p>
  343. <p>
  344. <span class="codefrag">hadoop dfs -lsr har:///user/zoo/foo.har/</span>
  345. </p>
  346. <p> To understand the significance of the -p argument, lets go through the above example again. If you just do
  347. an ls (not lsr) on the hadoop archive using </p>
  348. <p>
  349. <span class="codefrag">hadoop dfs -ls har:///user/zoo/foo.har</span>
  350. </p>
  351. <p>The output should be:</p>
  352. <pre class="code">
  353. har:///user/zoo/foo.har/dir1
  354. har:///user/zoo/foo.har/dir2
  355. </pre>
  356. <p> As you can recall the archives were created with the following command </p>
  357. <p>
  358. <span class="codefrag">hadoop archive -archiveName foo.har -p /user/hadoop dir1 dir2 /user/zoo </span>
  359. </p>
  360. <p> If we were to change the command to: </p>
  361. <p>
  362. <span class="codefrag">hadoop archive -archiveName foo.har -p /user/ hadoop/dir1 hadoop/dir2 /user/zoo </span>
  363. </p>
  364. <p> then a ls on the hadoop archive using </p>
  365. <p>
  366. <span class="codefrag">hadoop dfs -ls har:///user/zoo/foo.har</span>
  367. </p>
  368. <p>would give you</p>
  369. <pre class="code">
  370. har:///user/zoo/foo.har/hadoop/dir1
  371. har:///user/zoo/foo.har/hadoop/dir2
  372. </pre>
  373. <p>
  374. Notice that the archived files have been archived relative to /user/ rather than /user/hadoop.
  375. </p>
  376. </div>
  377. <a name="N10096"></a><a name="Hadoop+Archives+and+MapReduce"></a>
  378. <h2 class="h3">Hadoop Archives and MapReduce </h2>
  379. <div class="section">
  380. <p>Using Hadoop Archives in MapReduce is as easy as specifying a different input filesystem than the default file system.
  381. If you have a hadoop archive stored in HDFS in /user/zoo/foo.har then for using this archive for MapReduce input, all
  382. you need to specify the input directory as har:///user/zoo/foo.har. Since Hadoop Archives is exposed as a file system
  383. MapReduce will be able to use all the logical input files in Hadoop Archives as input.</p>
  384. </div>
  385. </div>
  386. <!--+
  387. |end content
  388. +-->
  389. <div class="clearboth">&nbsp;</div>
  390. </div>
  391. <div id="footer">
  392. <!--+
  393. |start bottomstrip
  394. +-->
  395. <div class="lastmodified">
  396. <script type="text/javascript"><!--
  397. document.write("Last Published: " + document.lastModified);
  398. // --></script>
  399. </div>
  400. <div class="copyright">
  401. Copyright &copy;
  402. 2008 <a href="http://www.apache.org/licenses/">The Apache Software Foundation.</a>
  403. </div>
  404. <!--+
  405. |end bottomstrip
  406. +-->
  407. </div>
  408. </body>
  409. </html>