top of page

Developer Tips: Working with Datasets - Reduce, Relocate, Reuse

Years ago, I was working on fixing a dashboard for a company. This dashboard displayed several tables of data and a few charts. It loaded in around four minutes.


You read that right.


Four. Minutes.


For a dashboard. Nothing on the screen for four minutes. That's far from a desirable user experience.


As I dug into the code, I saw several problems.


First was the SIZE of the datasets being manipulated. Many of the datasets were between 10K and 40K records. These were then massaged using LINQ to produce the final datasets that powered the graphs and charts.


The second was WHERE the datasets were being manipulated. All work was being done within the main method that rendered the entire dashboard. Everything was synchronous and serial and blocked the UI thread until everything was completed. All necessary data was pulled from the database, up through the API, and then massaged in the front-end code based on business rules.


The final issue was the REPEATED retrieval of the same data. In cases where a dataset was used more than once, it was retrieved from the database multiple times.


The front-end code looked something like this:

class Dashboard{
	var api = new API(...);
	LoadDashboard(){
		var dsA = api.GetDatasetA();
		var dsB = api.GetDatasetB();
		[processing]
		this.ChartA.data = processedDataA;
	
		var dsA = api.GetDatasetA();
		var dsC = api.GetDatasetC();
		[processing]
		this.ChartB.data = processedDataB;
	}
}

Note: all code is pseudo-code and is meant for illustrative purposes only.


Reduce Dataset Size

Understanding the size of the datasets your code will be manipulating is important. If your code runs at O(n^2) and your dataset has 10 elements, it's not a big deal (but you should still fix it). If your dataset has 10 million elements, it's a very different story. If you don't know how big the datasets you'll be working with are, assume the worst and optimize your code appropriately.


No matter how efficient your algorithm is, if you're able to run it with less data, you're doing to be able to deliver results to your users faster.

Look at what you can do to reduce the size of that dataset.

Can you add filtering to the query that retrieves the data to reduce the size?

Can you filter the dataset after retrieval to shrink it?

Can you partition the dataset and work with only a portion of it at a time?


In this case, I was able to alter the queries and provide some criteria to shrink some of the datasets from ~40K rows to ~10K rows. In other cases, I was able to bring the datasets down to hundreds of rows This was a major improvement already.


Relocate Processing

My next step was to look at where the processing was occurring. These datasets were being mashed together along with some business rules to generate the data that went into the tables on the dashboard. My first instinct was to try and move all of this to stored procedures but architectural constraints prevented this from being an option.


When I was handed the problem, all of the processing was being done in the front end of an ASP.NET application. The processing blocked everything until it was complete. That seemed sub-optimal. I moved the generation of the source datasets to API the layer and made each call asynchronous. The work to massage two datasets to produce a new dataset was moved to asynchronous methods within the front-end application. Now each (smaller) dataset could be loaded individually and not block the rest of the UI. Additionally, in other places in the system where those datasets were needed, I used the new API methods.


With each chart or graph on the dashboard being delivered by an asynchronous method, the dashboard could start displaying data much sooner.


Reuse Data

I noticed the repeated retrieval of the same datasets. Dataset A would be merged with dataset B to produce chart X. Dataset A would be merged with dataset C to product chart Y. And for each chart, dataset A was retrieved from the database. This resulted in a few issues. In a couple of places, the code had multiple instances of a dataset in memory. This was not optimal. Also, we were paying the price of retrieval multiple times. That's also not efficient.


To solve this problem, I started with a micro-caching strategy. Each API call to retrieve a dataset would hold the dataset in the cache for between one and five minutes. Thus, if the front end needed a dataset again, the call to retrieve it would be very quick. This had effects not only on the dashboard but other calls throughout the system.


Secondly, I refactored the code to the point where datasets for the dashboard were only loaded once and passed into the methods where they were used. Yes, you're right. That's called Dependency Injection. If I needed dataset A for two methods, I simply passed it in. Easy as pie!


Conclusion

By reducing the size of the datasets the code was working with, reusing the datasets instead of retrieving them from the database, and moving the processing of them to asynchronous methods, I was able to reduce the complete load time of the dashboard to under 15 seconds. Additionally, because items were being loaded asynchronously, users were being presented with information much more rapidly than before.


The new code looked more like this:

class Dashboard(){
	public async LoadDashboard(){
		var datasetResults = await GetDatasetsAsync();
		var loadChartATask = LoadChartAAsync(datasetResults.DatasetA, 		datasetResults.DatasetB);
		var loadChartBTask = LoadChartAAsync(datasetResults.DatasetA, 		datasetResults.DatasetC);
...
		Task.WhenAll(loadChartATask, loadChartBTask...); 
	}

	async LoadChartASync(ds1, ds2){
		[processing]
		this.ChartA.data = processedDataA;
	}

	async LoadChartBSync(ds1, ds2){
		[processing]
		this.ChartB.data = processedDataB;
	}
}

Take Away

As a software developer, when you're building systems or just working within part of a system, consider the following:

  1. How can you Reduce the size of the datasets you're working with?

    1. Can you add filters at the database level?

    2. Can you filter it quickly after retrieving it?

    3. Can you fragment it and work with it in pieces?

  2. Can you Relocate the processing of the data to improve performance?

    1. Can you make your processing methods async?

    2. Can you move the processing to a new thread?

  3. Can you Reuse a dataset you already have in memory?

    1. Can you cache the data?

    2. Can you use Dependency Injection to use the data multiple times?


Comments


Email the Coach

Thanks! We'll get back to you soon!

© Sean Cooper 2025
bottom of page