ETL Talk

Posts

Showing posts from April, 2018

Setting Up A Data Transfer with Sqoop

- April 21, 2018

According to its page on the Apache website , Sqoop is "a tool designed for efficiently transferring data between Apache Hadoop and structured datastores such as relational databases." This process is largely automated, due to the way that the database schema informs Sqoop about how the data is to be imported, as per the user guide . In this post, we're going to import data from a MySQL database and then transfer it to Hive. Setting Up the MySQL Data The first step will be to set up the MySQL databases and tables that will be used for the transfer. First we'll create a database called "sqoop_test": mysql> CREATE DATABASE sqoop_test; mysql> USE sqoop_test; Next, the table "stocks" will be created, preparing it for a text file with stock data to be populated: mysql> CREATE TABLE stocks ( id INT NOT NULL AUTO_INCREMENT PRIMARY KEY, symbol VARCHAR(100), quote_date VARCHAR(100), open_price DOUBLE PRECISION, high_price DOUBLE PRECISIO...

The Art of the Hive Query, Part 2

- April 19, 2018

The previous post went over information detailing how to create a database and tables. There was also some information about the kinds of data types that one sees in Hive. In this post, we're going to import the San Francisco Building Permits information into Hive and then run some queries on the data. Hive can import data from a variety of sources, but arguably the two most common are HDFS and the local filesystem. To transfer files from the local filesystem to HDFS, you would need to use the "put" command, like so: > hdfs dfs -put /path/to/file/2011_Population_Estimates.csv /user/data To load the data into the Hive table, we will need to use the "load data inpath" parameter: hive> LOAD DATA INPATH '/user/data/2011_Population_Estimates.csv' > OVERWRITE INTO TABLE 2011_pop_est; Loading data to table datasets.2011_pop_est Table datasets.2011_pop_est stats: [numFiles=1, numRows=0, totalSize=2259, rawDataSize=0] OK Time taken: 0.2...

The Art of the Hive Query, Part 1

- April 12, 2018

Hive is a data warehousing framework that uses SQL-esque queries to analyze data via MapReduce. Hive's query language is referred to as "HiveQL" and is not fully SQL-92 compliant. Among other things, it only has basic support for indexes, but adds multi-table inserts. Nevertheless, anyone familiar with the various ways of running SQL queries should have no problem adjusting to Apache Hive. There are a number of ways to analyze and query big data with Hive - it can even be used for cleaning data! Although you could (if you have Linux on your system) install Hadoop and Hive locally, more realistically you'll probably want to use a VM (virtual machine) by Cloudera or Hortonworks . Creating a Database Creating a database in Hive is straightforward, although "database" in this context is actually a catalog of namespaces or clusters. If you don't create a database, then any tables you create will go in the "default" database. Simply use th...