Big Data Analytics
This page contains links to html renders of Databricks notebooks that I developed in Summer 2020 for the following two courses offered at Maryville University:
- DSCI 417 - Big Data Analytics (Undergraduate)
- DSCI 617 - Big Data Analytics (Graduate)
These courses cover the use of Apache Spark for performing distributed data analysis. The courses emphasize the use of PySpark to develop applications within a Databricks environment.
- 01 - Introduction to Spark
- 02 - Databricks Workspace
- 03 - Databricks Clusters
- 04 - Databricks Notebooks
- 05 - Introduction to PySpark
- 06 - Introduction to RDDs
- 07 - Subsetting and Partitions
- 08 - Map and FlatMap
- 09 - Filter, sortBy, and Reduce
- 10 - Lazy Evaluation and Persistence
- 11 - Pair RDDs
- 12 - Example: Gapminder Dataset
- 13 - Example: Word Count
- 14 - Intro to DataFrames
- 15 - Exploring DataFrames
- 16 - Working with Columns
- 17 - Column Functions
- 18 - Filtering, Sorting, and Grouping
- 19 - Inner and Outer Joins
- 20 - Filtering Joins and Cross Joins
- 21 - Additional Join Topics
- 22 - Example_ Warehouse Inventory
- 23 - Spark SQL
- 24 - Introduction to Machine Learning
- 25 - Logistic Regression
- 26 - Multiclass Logistic Regression
- 27 - One-Hot Encoding
- 28 - Pipelines
- 29 - Classification Metrics
- 30 - Overfitting
- 31 - Cross-Validation
- 32 - Regularized Logistic Regression
- 33 - Grid Search for Logistic Regression
- 34 - Decision Trees
- 35 - Grid Search for Decision Trees
- 36 - Random Forests
- 37 - Introduction to Streaming
- 38 - Structured Streaming
- 39 - Windowing
- 40 - Sources and Sinks