Hi Mike,
This is an interesting question, and I have created a sample application demonstrating two approaches.
The app is here: GitHub - cuba-labs/large-dataset: How to handle large data sets
It works with local PostgreSQL database, so if you want to run it you should have PostgreSQL installed.
It has two entities: Order and Customer, and a service that calculates volume of sales by months.
If you open Orders browser, you can see three buttons:
-
“Generate data”: creates 10,000 customers and 500,000 orders for them in the database
-
“Process method 1” and “Process method 2” calculate volumes using the first and the second approaches (see below).
First approach
Firstly we load a list of identifiers of all objects we are going to process - all orders in this case. It can be done in a single transaction and put into a list in memory because identifiers are small and the total size of loaded data even for large datasets (500K records) can easily fit into memory.
private List<UUID> loadIdList() {
List<UUID> idList;
try (Transaction tx = persistence.createTransaction(new TransactionParams().setReadOnly(true))) {
TypedQuery<UUID> query = persistence.getEntityManager().createQuery(
"select o.id from sample$Order o", UUID.class);
idList = query.getResultList();
tx.commit();
}
return idList;
}
Then we process objects in batches, splitting the list of IDs and loading batches of objects in separate transactions.
public Map<String, BigDecimal> process1() {
log.info("Processing method 1");
List<UUID> idList = loadIdList();
HashMap<String, BigDecimal> result = new HashMap<>();
AtomicInteger counter = new AtomicInteger();
idList.stream()
.collect(Collectors.groupingBy(o -> counter.getAndIncrement() / 100))
.forEach((chunk, chunkIds) -> {
try (Transaction tx = persistence.createTransaction(new TransactionParams().setReadOnly(true))) {
TypedQuery<Order> query = persistence.getEntityManager().createQuery(
"select o from sample$Order o where o.id in ?1", Order.class);
query.setParameter(1, chunkIds);
List<Order> orders = query.getResultList();
for (Order order : orders) {
processOrder(order.getDate(), order.getAmount(), result);
}
tx.commit();
}
log.info("Processed orders: " + chunk * 100);
});
return result;
}
On my laptop with default settings for Tomcat JVM and for PostgreSQL database the processing of 500,000 records is done in 10…12 seconds.
Second approach
Here I use EclipseLink’s CursoredStream
in a single transaction.
public Map<String, BigDecimal> process2() {
log.info("Processing method 2");
HashMap<String, BigDecimal> result = new HashMap<>();
try (Transaction tx = persistence.createTransaction(new TransactionParams().setReadOnly(true))) {
UnitOfWork unitOfWork = persistence.getEntityManager().getDelegate().unwrap(UnitOfWork.class);
ReadAllQuery query = new ReadAllQuery(Order.class);
query.useCursoredStream();
CursoredStream stream = (CursoredStream) unitOfWork.executeQuery(query);
int i = 0;
while (!stream.atEnd()) {
Order order = (Order) stream.read();
processOrder(order.getDate(), order.getAmount(), result);
stream.releasePrevious();
i++;
if (i % 100 == 0) {
log.info("Processed orders: " + i);
}
}
stream.close();
tx.commit();
}
return result;
}
Unfortunately, the processing is stopped on about 200,000 records with “GC overhead limit exceeded” exception, which indicates that too much data is loaded and not released. Perhaps I’m doing something wrong, would be great if someone gives a hint for how to use it correctly.