Project: Data Analysis with Streams
In this capstone lesson you will apply everything from the tutorial — filtering, mapping, collecting, reducing, flatMapping, sorting, and working with Optional — to a single, realistic scenario: analysing a dataset of employee records. By the end you will have one cohesive program that asks ten real business questions and answers each one with a focused stream pipeline.
The Dataset
We start with a simple record to represent each employee. Records give us immutability and auto-generated constructors, getters, and toString for free.
public record Employee(
String name,
String department,
double salary,
int yearsOfExperience,
List<String> skills
) {}
Then we build a list that we will query throughout the project:
import java.util.*;
import java.util.stream.*;
List<Employee> employees = List.of(
new Employee("Alice", "Engineering", 95_000, 7, List.of("Java", "Kotlin", "SQL")),
new Employee("Bob", "Engineering", 82_000, 3, List.of("Java", "Python")),
new Employee("Carol", "Marketing", 68_000, 5, List.of("SEO", "Analytics")),
new Employee("David", "Engineering", 110_000, 12, List.of("Java", "Scala", "Spark")),
new Employee("Eve", "HR", 60_000, 2, List.of("Communication", "Excel")),
new Employee("Frank", "Marketing", 73_000, 6, List.of("SEO", "PPC", "Analytics")),
new Employee("Grace", "HR", 67_000, 8, List.of("Recruiting", "Excel")),
new Employee("Henry", "Engineering", 91_000, 5, List.of("Python", "Docker", "SQL")),
new Employee("Irene", "Marketing", 78_000, 9, List.of("Analytics", "Branding")),
new Employee("James", "Engineering", 99_000, 10, List.of("Java", "Kubernetes", "SQL"))
);
Why use a record? Records (introduced in Java 16) are perfect for plain data carriers like rows in a dataset. They enforce immutability, eliminate boilerplate, and signal to readers that the class is purely a data holder with no hidden behaviour.
Question 1 — How many employees are in Engineering?
long engineeringCount = employees.stream()
.filter(e -> e.department().equals("Engineering"))
.count();
System.out.println("Engineering headcount: " + engineeringCount); // 5
Question 2 — What is the average salary across the whole company?
OptionalDouble avgSalary = employees.stream()
.mapToDouble(Employee::salary)
.average();
avgSalary.ifPresent(avg ->
System.out.printf("Company average salary: $%.2f%n", avg));
Question 3 — Who is the highest-paid employee?
Optional<Employee> topEarner = employees.stream()
.max(Comparator.comparingDouble(Employee::salary));
topEarner.ifPresent(e ->
System.out.println("Top earner: " + e.name() + " ($" + e.salary() + ")"));
Question 4 — List all unique skills used in Engineering
flatMap is the right tool here: each employee has a list of skills, so we need to flatten many lists into one stream before deduplicating.
List<String> engineeringSkills = employees.stream()
.filter(e -> e.department().equals("Engineering"))
.flatMap(e -> e.skills().stream())
.distinct()
.sorted()
.collect(Collectors.toList());
System.out.println("Engineering skills: " + engineeringSkills);
// [Docker, Java, Kotlin, Kubernetes, Python, SQL, Scala, Spark]
Question 5 — Average salary per department
Collectors.groupingBy combined with a downstream averagingDouble collector answers this in one pass:
Map<String, Double> avgByDept = employees.stream()
.collect(Collectors.groupingBy(
Employee::department,
Collectors.averagingDouble(Employee::salary)
));
avgByDept.forEach((dept, avg) ->
System.out.printf("%-15s avg salary: $%.2f%n", dept, avg));
Question 6 — Names of employees earning above $90,000, sorted alphabetically
List<String> highEarnerNames = employees.stream()
.filter(e -> e.salary() > 90_000)
.map(Employee::name)
.sorted()
.collect(Collectors.toList());
System.out.println("Earning > $90k: " + highEarnerNames);
// [Alice, David, James, James] — wait, let's verify
Chain filter before map. Filtering first reduces the number of elements that flow into the more expensive mapping step. While the JVM can sometimes re-order operations, writing filter → map makes the intent clear and is always safe.
Question 7 — Total salary budget per department
Map<String, Double> budgetByDept = employees.stream()
.collect(Collectors.groupingBy(
Employee::department,
Collectors.summingDouble(Employee::salary)
));
budgetByDept.forEach((dept, total) ->
System.out.printf("%-15s total budget: $%.0f%n", dept, total));
Question 8 — The most experienced employee in each department
Collectors.toMap with a merge function picks the winner when two employees map to the same key:
Map<String, Employee> mostExperienced = employees.stream()
.collect(Collectors.toMap(
Employee::department,
e -> e,
(a, b) -> a.yearsOfExperience() >= b.yearsOfExperience() ? a : b
));
mostExperienced.forEach((dept, e) ->
System.out.println(dept + " → " + e.name() + " (" + e.yearsOfExperience() + " yrs)"));
Question 9 — Do any employees know both Java and SQL?
Use anyMatch for a short-circuiting existence check — it stops as soon as a match is found:
boolean javaAndSql = employees.stream()
.anyMatch(e -> e.skills().containsAll(List.of("Java", "SQL")));
System.out.println("Someone knows Java & SQL: " + javaAndSql); // true
Question 10 — Summary statistics for Engineering salaries
DoubleSummaryStatistics captures count, sum, min, max, and average in a single terminal operation:
DoubleSummaryStatistics stats = employees.stream()
.filter(e -> e.department().equals("Engineering"))
.mapToDouble(Employee::salary)
.summaryStatistics();
System.out.println("Engineering salary stats:");
System.out.println(" Count : " + stats.getCount());
System.out.printf (" Min : $%.0f%n", stats.getMin());
System.out.printf (" Max : $%.0f%n", stats.getMax());
System.out.printf (" Avg : $%.2f%n", stats.getAverage());
System.out.printf (" Total : $%.0f%n", stats.getSum());
Putting It All Together — What You Practised
- filter + count — headcount by department (Q1).
- mapToDouble + average — numeric aggregation with
OptionalDouble (Q2).
- max with Comparator — finding a single winner via
Optional (Q3).
- flatMap + distinct + sorted — flattening nested collections (Q4).
- groupingBy + averagingDouble / summingDouble — multi-group aggregation (Q5, Q7).
- filter + map + sorted + collect — the classic pipeline (Q6).
- toMap with merge function — keyed aggregation with conflict resolution (Q8).
- anyMatch — short-circuit existence check (Q9).
- summaryStatistics — bulk numeric stats in one pass (Q10).
Streams are not a silver bullet. For very small lists a plain for loop is simpler and just as fast. Choose streams when the declarative style makes the intent clearer — which it almost always does for filtering, grouping, and aggregating real datasets.
Summary
You have now built a complete data-analysis program using nothing but the Streams API. The key insight is that every business question maps naturally to a pipeline: filter down to the relevant rows, map or flatMap to the values you care about, then collect or reduce to the final answer. Master that mental model and you can query any in-memory dataset fluently in Java.