Posts

Showing posts from April, 2023

Streaming Data Project

Image
  Streaming isn't just for Netflix aficionados! This project will focus on streaming data, using tools like Apache Kafka and ksqlDB. The data source that we will use for Extraction will be a Python script that generates random text. This will be ingested by Kafka and transformations will be done in ksqlDB. Further transformation and loading of the data will be handled in a future post. Let's Start With Some Text! As mentioned, this process will start with a python script that will generate oodles of random sentences. This functionality is brought to us by the module " Faker ," which is described as "a Python package that generates fake data for you." We're also going to use a Kafka connector for python, kafka-python , which will help us generate a data stream and shove it into Kafka's hungry maw. Once they're installed via pip (or pip3, as some would have it), the import string is straightforward: from kafka import KafkaProducer from faker import

Real Estate Data Pipeline, Part 2

Image
  In our last data adventure, the E and the T of our ETL process was accomplished by way of ingesting real estate data via web scraping with python; then, some transformations were done using Apache Spark to clean the data and format it as needed. Next, we're going to be loading it into AWS Redshift and doing some visualizations in Power BI. Making a connection to AWS Redshift via Python can be a daunting task at first, but there are modules like AWS Data Wrangler (now known as AWS SDK for pandas)  that can simplify the process somewhat. AWS SDK for Pandas, in their own words, is a "python initiative that extends the power of Pandas library to AWS connecting DataFrames and AWS data related services." And they boast "easy integration with Athena, Glue, Redshift, Timestream, OpenSearch, Neptune, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL)." But, this isn't an AWS Wrangler