Posts

Showing posts from March, 2023

Real Estate Data Pipeline, Part 1

Image
 This will be the first of two posts documenting the process of creating a data pipeline using real estate data. Overall, the whole adventure will include: Web-scraping Toronto real estate data Cleaning and transforming everything in Spark Exporting and loading the data into AWS RedShift Creating a visualization that puts it all into graphical form Overall, this will be a post dealing not only with the creation of the pipeline, but things that were learned along the way. Extract - Web-Scraping with Python This was my first foray into web-scraping and, using BeautifulSoup, the process was a lot easier than I had anticipated. The web-scraping code can be found here . We begin with our import statements: from bs4 import BeautifulSoup import requests import pandas as pd import re import time BeautifulSoup, as mentioned is one of the standard web-scraping libraries, but there are also options like Selenium, Scrapy and lxml. RE will provide us with regular expression pattern-mat