Extracting metadata from Apache Hive for migration
This document shows how you can use the dwh-migration-dumper
tool to extract
the necessary metadata before running a Apache Hive data or permissions
migration.
This document covers metadata extraction from the following data sources:
- Apache Hive
- Apache Hadoop Distributed File System (HDFS)
- Apache Ranger
- Cloudera Manager
- Apache Hive query logs
Before you begin
Before you can use the dwh-migration-dumper
tool, do the following:
Install Java
The server on which you plan to run dwh-migration-dumper
tool must have
Java 8 or higher installed. If it doesn't, download Java from the
Java downloads page
and install it.
Required permissions
The user account that you specify for connecting the dwh-migration-dumper
tool to
the source system must have permissions to read metadata from that system.
Confirm that this account has appropriate role membership to query the metadata
resources available for your platform. For example, INFORMATION_SCHEMA
is a
metadata resource that is common across several platforms.
Install the dwh-migration-dumper
tool
To install the dwh-migration-dumper
tool, follow these steps:
- On the machine where you want to run the
dwh-migration-dumper
tool, download the zip file from thedwh-migration-dumper
tool GitHub repository. To validate the
dwh-migration-dumper
tool zip file, download theSHA256SUMS.txt
file and run the following command:Bash
sha256sum --check SHA256SUMS.txt
If verification fails, see Troubleshooting.
Windows PowerShell
(Get-FileHash RELEASE_ZIP_FILENAME).Hash -eq ((Get-Content SHA256SUMS.txt) -Split " ")[0]
Replace the
RELEASE_ZIP_FILENAME
with the downloaded zip filename of thedwh-migration-dumper
command-line extraction tool release—for example,dwh-migration-tools-v1.0.52.zip
The
True
result confirms successful checksum verification.The
False
result indicates verification error. Make sure the checksum and zip files are downloaded from the same release version and placed in the same directory.Extract the zip file. The extraction tool binary is in the
/bin
subdirectory of the folder created by extracting the zip file.Update the
PATH
environment variable to include the installation path for the extraction tool.
Extracting metadata for migration
Select one of the following options to learn how to extract metadata for your data source:
Apache Hive
Run the following command to extract extract metadata from Apache Hive
using the dwh-migration-dumper
tool.
dwh-migration-dumper \
--connector hiveql \
--host HIVE-HOST \
--port HIVE-PORT \
--output gs://MIGRATION-BUCKET/hive-dumper-output.zip \
--assessment \
Replace the following:
HIVE-HOST
: the hostname of the Apache Hive instanceHIVE-PORT
: the port number of the Apache Hive instance. You can skip this argument if you are using the default9083
port.MIGRATION-BUCKET
: the Cloud Storage bucket that you are using to store the migration files.
This command extracts metadata from Apache Hive to a
file named hive-dumper-output.zip
in the MIGRATION-BUCKET
directory.
HDFS
Run the following command to extract extract metadata from HDFS
using the dwh-migration-dumper
tool.
dwh-migration-dumper \
--connector hdfs \
--host HDFS-HOST \
--port HDFS-PORT \
--output gs://MIGRATION-BUCKET/hdfs-dumper-output.zip \
--assessment \
Replace the following:
HDFS-HOST
: the HDFS NameNode hostnameHDFS-PORT
: the HDFS NameNode port number. You can skip this argument if you are using the default8020
port.MIGRATION-BUCKET
: the Cloud Storage bucket that you are using to store the migration files.
This command extracts metadata from HDFS to a
file named hdfs-dumper-output.zip
in the MIGRATION-BUCKET
directory.
There are several known limitations when extracting metadata from HDFS:
- Some tasks in this connector are optional and can fail, logging a full
stack trade in the output. As long as the required tasks have
succeeded and the
hdfs-dumper-output.zip
is generated, then you can proceed with the HDFS migration. - The extraction process might fail or run slower than expected if the
configured thread pool size is too large. If you are encountering these
issues, we recommend decreasing the thread pool size using the command
line argument
--thread-pool-size
.
Apache Ranger
Run the following command to extract extract metadata from Apache Ranger
using the dwh-migration-dumper
tool.
dwh-migration-dumper \
--connector ranger \
--host RANGER-HOST \
--port 6080 \
--user RANGER-USER \
--password RANGER-PASSWORD \
--ranger-scheme RANGER-SCHEME \
--output gs://MIGRATION-BUCKET/ranger-dumper-output.zip \
--assessment \
Replace the following:
RANGER-HOST
: the hostname of the Apache Ranger instanceRANGER-USER
: the username of the Apache Ranger userRANGER-PASSWORD
: the password of the Apache Ranger userRANGER-SCHEME
: specify if Apache Ranger is usinghttp
orhttps
. Default value ishttp
.MIGRATION-BUCKET
: the Cloud Storage bucket that you are using to store the migration files.
You can also include the following optional flags:
--kerberos-auth-for-hadoop
: replaces--user
and--password
, if Apache Ranger is protected by kerberos instead of basic authentication. You must run thekinit
command before thedwh-migration-dumper
tool tool to use this flag.--ranger-disable-tls-validation
: include this flag if the https certificate used by the API is self signed. For example, when using Cloudera.
This command extracts metadata from Apache Ranger to a
file named ranger-dumper-output.zip
in the MIGRATION-BUCKET
directory.
Cloudera
Run the following command to extract metadata from Cloudera
using the dwh-migration-dumper
tool.
dwh-migration-dumper \
--connector cloudera-manager \
--url CLOUDERA-URL \
--user CLOUDERA-USER \
--password CLOUDERA-PASSWORD \
--output gs://MIGRATION-BUCKET/cloudera-dumper-output.zip \
--yarn-application-types APPLICATION-TYPES \
--pagination-page-size PAGE-SIZE \
--assessment \
Replace the following:
CLOUDERA-URL
: the URL for Cloudera ManagerCLOUDERA-USER
: the username of the Cloudera userCLOUDERA-PASSWORD
: the password of the Cloudera userMIGRATION-BUCKET
: the Cloud Storage bucket that you are using to store the migration files.APPLICATION-TYPES
: (Optional) list of all existing application types from Hadoop YARN. For example,SPARK, MAPREDUCE
.PAGE-SIZE
: (Optional) specify how much data is fetched from 3rd party services, like the Hadoop YARN API. The default value is1000
, which represents 1000 entities per request.
This command extracts metadata from Cloudera to a
file named dwh-migration-cloudera.zip
in the MIGRATION-BUCKET
directory.
Apache Hive query logs
Perform the steps in the Apache Hive section Extract metadata and query logs from your data warehouse to extract your Apache Hive query logs. You can then upload the logs to your Cloud Storage bucket containing your migration files.
What's next
With your extracted metadata from Hadoop, you can use these metadata files to do the following: