Skip to content
Advertisement

How to do CopyMerge in Hadoop 3.0?

I know hadoop version 2.7‘s FileUtil has the copyMerge function that merges multiple files into a new one.

But the copyMerge function is no longer supported per the API in the 3.0 version.

Any ideas on how to merge all files within a directory into a new single file in the 3.0 version of hadoop?

Advertisement

Answer

FileUtil#copyMerge method has been removed. See details for the major change:

https://issues.apache.org/jira/browse/HADOOP-12967

https://issues.apache.org/jira/browse/HADOOP-11392

You can use getmerge

Usage: hadoop fs -getmerge [-nl]

Takes a source directory and a destination file as input and concatenates files in src into the destination local file. Optionally -nl can be set to enable adding a newline character (LF) at the end of each file. -skip-empty-file can be used to avoid unwanted newline characters in case of empty files.

Examples:

hadoop fs -getmerge -nl /src /opt/output.txt
hadoop fs -getmerge -nl /src/file1.txt /src/file2.txt /output.txt

Exit Code: Returns 0 on success and non-zero on error.

https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html#getmerge

User contributions licensed under: CC BY-SA
10 People found this is helpful
Advertisement