Robbie Beane

Assistant Professor of Data Science
Program Director for Computer Science
Maryville University

Big Data Analytics

This page contains links to html renders of Databricks notebooks that I developed in Summer 2020 for the following two courses offered at Maryville University:

DSCI 417 - Big Data Analytics (Undergraduate)
DSCI 617 - Big Data Analytics (Graduate)

These courses cover the use of Apache Spark for performing distributed data analysis. The courses emphasize the use of PySpark to develop applications within a Databricks environment.

01 - Introduction to Spark
02 - Databricks Workspace
03 - Databricks Clusters
04 - Databricks Notebooks
05 - Introduction to PySpark
06 - Introduction to RDDs
07 - Subsetting and Partitions
08 - Map and FlatMap
09 - Filter, sortBy, and Reduce
10 - Lazy Evaluation and Persistence
11 - Pair RDDs
12 - Example: Gapminder Dataset
13 - Example: Word Count
14 - Intro to DataFrames
15 - Exploring DataFrames
16 - Working with Columns
17 - Column Functions
18 - Filtering, Sorting, and Grouping
19 - Inner and Outer Joins
20 - Filtering Joins and Cross Joins
21 - Additional Join Topics
22 - Example_ Warehouse Inventory
23 - Spark SQL
24 - Introduction to Machine Learning
25 - Logistic Regression
26 - Multiclass Logistic Regression
27 - One-Hot Encoding
28 - Pipelines
29 - Classification Metrics
30 - Overfitting
31 - Cross-Validation
32 - Regularized Logistic Regression
33 - Grid Search for Logistic Regression
34 - Decision Trees
35 - Grid Search for Decision Trees
36 - Random Forests
37 - Introduction to Streaming
38 - Structured Streaming
39 - Windowing
40 - Sources and Sinks