Posts

Imperial to Metric Conversion (and vice-versa) Script

Image
 Imperial to Metric Conversion (and vice-versa) Script An idea for a project that I had was one to have a python script that would use functions to convert Imperial values to Metric and also Metric values to Imperial. This is useful here in Canada, where we will often measure temperature in Celcius and distance in kilometers, but then also measure height in feet and weight in pounds! It can all get a bit much to keep track of sometimes. I wanted to get more familiar with classes in python, so I started with a master 'Conversion' class that would handle the math, and then several local classes for each kind of unit being converted. This is the main class and the units being utilized: class Converter:     """     A class to convert values between different units of measurement.     Supports temperature, length, volume, and weight conversions.     """     def __init__(self):         """         Initializes the Converter class with dicti

The Basics of IICS

Image
 Informatica Intelligent Cloud Services is Informatica's cloud-based offering for ETL and data integration. It's an iPaaS solution. According to Informatica's website , it "unifies existing Informatica cloud service offerings and expands into a full suite of cloud data management services over time." There are four primary services that you will be greeted with when using IICS: Data Integration, Administration, Monitoring, and Operational Insights: Fortunately, Informatica has a 30-day demo version so I decided to dive in and check it out! Although most of the program is web-based, it's still necessary to install a secure agent locally if you want to get any of the connections, mappings, or other functions working properly. In this case, I decided to install it in Windows using the "agent64_install_ng_ext.6504.exe" executable. A word of caution: this only seems to work properly when installing to the default location, so hopefully you weren't set

Streaming Data Project

Image
  Streaming isn't just for Netflix aficionados! This project will focus on streaming data, using tools like Apache Kafka and ksqlDB. The data source that we will use for Extraction will be a Python script that generates random text. This will be ingested by Kafka and transformations will be done in ksqlDB. Further transformation and loading of the data will be handled in a future post. Let's Start With Some Text! As mentioned, this process will start with a python script that will generate oodles of random sentences. This functionality is brought to us by the module " Faker ," which is described as "a Python package that generates fake data for you." We're also going to use a Kafka connector for python, kafka-python , which will help us generate a data stream and shove it into Kafka's hungry maw. Once they're installed via pip (or pip3, as some would have it), the import string is straightforward: from kafka import KafkaProducer from faker import

Real Estate Data Pipeline, Part 2

Image
  In our last data adventure, the E and the T of our ETL process was accomplished by way of ingesting real estate data via web scraping with python; then, some transformations were done using Apache Spark to clean the data and format it as needed. Next, we're going to be loading it into AWS Redshift and doing some visualizations in Power BI. Making a connection to AWS Redshift via Python can be a daunting task at first, but there are modules like AWS Data Wrangler (now known as AWS SDK for pandas)  that can simplify the process somewhat. AWS SDK for Pandas, in their own words, is a "python initiative that extends the power of Pandas library to AWS connecting DataFrames and AWS data related services." And they boast "easy integration with Athena, Glue, Redshift, Timestream, OpenSearch, Neptune, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL)." But, this isn't an AWS Wrangler

Real Estate Data Pipeline, Part 1

Image
 This will be the first of two posts documenting the process of creating a data pipeline using real estate data. Overall, the whole adventure will include: Web-scraping Toronto real estate data Cleaning and transforming everything in Spark Exporting and loading the data into AWS RedShift Creating a visualization that puts it all into graphical form Overall, this will be a post dealing not only with the creation of the pipeline, but things that were learned along the way. Extract - Web-Scraping with Python This was my first foray into web-scraping and, using BeautifulSoup, the process was a lot easier than I had anticipated. The web-scraping code can be found here . We begin with our import statements: from bs4 import BeautifulSoup import requests import pandas as pd import re import time BeautifulSoup, as mentioned is one of the standard web-scraping libraries, but there are also options like Selenium, Scrapy and lxml. RE will provide us with regular expression pattern-mat

The Wizard of Oozie

Image
As per the official website , Apache Oozie is a "workflow scheduler system to manage Apache Hadoop jobs." It automates the running of Hadoop jobs through the use of a workflow engine and a coordinator engine . Oozie was made to work with other common Hadoop tools such as Pig, Hive, and Sqoop, but it can also can be extended to support custom Hadoop jobs. We'll start by creating a database in MySQL called "oozie". For now, we won't create a table or populate it with data since the jobs in the workflow will take care of that: mysql> create database oozie; mysql> use oozie; What the Workflow is All About The information used for this table will come from a file called business.csv and it will later be copied to a folder in HDFS, where it can be used to import into Hive. There are a number of files that will make up the processes in this job. There is one file that will retrieve the csv file from data.sfgov.org; there is another file that will