Fall 2025

Big Data Algorithms and Data Structures

Instructor:

Hu Fu fuhu@mail.shufe.edu.cn
Office: 504 School of Information Management and Engineering

Lectures:

Friday 8:55-11:45am, 301 Lecture Hall No 3

Syllabus:

This course is a theoretical introduction to basic data structures and algorithms used to deal with data of large scales. Randomization plays a crucial role in these techniques, and a large of part of this course focuses on randomized algorithms and data structures. Topics include a review of discrete probability theory, hashing, concentration inequalities, examples of randomized data structures, dimensionality reduction, streaming algorithms, and selected other topics.

Prerequisites:

Familiarity with basic data structures and algorithms will be assumed. Discrete probability theory will be used throughout the course; a quick review will be provided at the beginning of the course.

Texts:

There is no required textbook.
Supplementary readings are occasionally provided here. Optional readings will be marked as such.

Course Work:

Grades are determined by Problem sets/written assignments (30%) + Project (30%) + Final (40%). The final is a take-home exam. For the course project, students will form groups of up to 4 people and survey a topic related to big data data structures or algorithms. The instructor will suggest candidate topics, but students are encouraged to explore topics of their own interests. Each group will make an in-class presentation in the last lecture, and submit a survey. A survey can be in either Chinese or English, with an expected length of two pages. The projects will be evaluated based on the quality of presentation and the written survey.

Homework Policies:

We will have 3-4 written assignments. Students are encouraged to form groups of up to three people for each assignment. Each group needs to turn in only one solution, but every group member must be able to explain everything turned in.
No late solutions will be accepted.
Typesetting solutions using LaTex is encouraged. Here is a LaTeX template for your reference.
For assignments and exams, unless it is stated otherwise, for all questions that ask to design an algorithm, you need to provide justification for (i.e., to prove) the correctness of your algorithm. When the question asks for a certain running time (e.g. polynomial time, or O(n^2)), you should analyze the running time of your algorithm.
Assignments and their solutions will be posted on the Canvas system. You should turn in your solutions there or hand them in at the start of the lecture. Please make sure that your submission has all the names of your group members.
Some problems in the assignments are more challenging than the others. You are encouraged to discuss problems among yourselves or come to office hours. You should start thinking about the problems early and not wait till the last day or two. Allow yourself time to think and to seek help.
If you do work with someone outside your group or use some outside source, you must acknowledge them in your write-up.

Schedule:

Sept 12: Introduction and review of probabilities
- Introduction (slides)
- Review of probability theory: sample space and events (slides)
- Review of probability theory: random variables (slides)
- Reading (optional, only if you need a brush-up on probability theory): Lecture Notes by Albert Meyer and Ronitt Rubinfeld on introductory discrete probability theory, Sections 1, 2, 4.
- Reading (optional, only if you need a brush-up on random variables): Lecture Notes by Meyer and Rubinfeld on random variables, Sections 1, 2, 5.
Sept 19: Universal hashing, Bloom filters, Markov inequality
- Universal hashing (slides)
- Bloom filters (slides)
- Markov inequalities (slides)
- Reading: Lecture Notes by Avim Blum and Manuel Blum on universal hashing, Section 10.2-10.4 and 10.6.1.
- Reading: Lecture Notes by David Wagner on Bloom filters.
Sept 26: Chebyshev inequality, Chernoff bounds, Quicksort
- Chebyshev inequality and Chernoff bound (slides)
- Quicksort, Balls and Bins (slides)
- Reading: Lecture Notes by Avim Blum and Manuel Blum on perfect hashing. Section 10.5.
- Reading: Lecture Notes by Nick Harvey on Chernoff bounds, Section 1.
- Reading: Lecture Notes by Nick Harvey on negative binomial distribution and quicksort, Section 2 and 3.
Oct 10: Fisnishing quicksort; Balls and bins, skip lists, Johnson-Lindenstrauss transform
- Balls and Bins (slides)
- Skip list (slides)
- Johnson-Lindenstrauss (slides)
- Reading: Lecture Notes by Nick Harvey on balls and bins, Section 2.
- Reading: Lecture notes by Nick Harvey on SkipNet.
- Reading: Lecture notes by Nick Harvey on Johnson-Lindenstrauss transform.
Oct 17: Streaming model, AMS, k-wise independent hash
- AMS (slides)
- Reading: Lecture notes by Nick Harvey on k-wise independence.
- Reading: Lecture Notes by Nick Harvey on AMS.
- (Very optional reading): Chapter 7 of lecture notes by David Forney, an introduction to finite fields.
Oct 24 Finishing AMS. Count-min, Count-sketch, distinct elements
- Count-Min and Count-sketch (slides)
- Reading: Chapter 16.3-16.6 of Nick Harvey's book.
- Reading: Section 3.3 of Lecture notes by Chandra Chekuri on sparse recovery using Count Sketch
Oct 31: Distinct elements, Fast JL Transform,
- Sketch for distinct elements (slides)
- Fast JL-transform (slides)
- Reading: Lecture notes by Nick Harvey on the distinct element problem.
- Optional reading: Chapter 17 of Nick Harvey's book.
- Reading: Lecture notes by Nick Harvey on Fast JL-transform.
Nov 7: Sparse Recovery, Compressed Sensing
Nov 14: Finish Compressed Sensing, proof of RIP, similarity estimation
- Similarity estimation (slides)
- Reading (optional): Paper by Richard Baraniuk, Mark Davenport, Ronald DeVore, Michael Wakin.
Nov 21: Project presentations. Nearest neighbor search with JL-transform. Locality sensitive hashing for Hamming distance.
- Nearest neighbor search with JL-transform (slides)
- Reading (optional): Lecture notes by Moses Charikar.