Improving the Performance on the UMH Historian (Grafana / TimescaleDB) in 0.14.1

Loading our dashboards takes ages !
This is what came more and more often from our community members and customers.
We went on an optimization journey with the UMH Historian, specifically focusing on improving query performance within our Grafana and TimescaleDB setup. This article outlines the challenges we faced, the solutions we implemented, and how you can benefit by upgrading to the latest version or applying these improvements yourself.
The Original Problem: Very Long Query Times
Community and customer feedback pointed out severe lag in loading dashboards. Our initial assessments ruled out system performance issues, pinpointing the problem to inefficient SQL queries. An example of such a query from the Management Console's Tag Browser is shown below:
Identifying the Root Cause: The get_asset_id Function
We utilized EXPLAIN ANALYZE on a test system to diagnose the inefficiencies:
An insightful tip led us to discover that our get_asset_id function was not cached, causing significant performance overhead as it was re-evaluated for each database row.
Adjusting Function Volatility of get_asset_id
Our current (0.14.0) function implementation is available here, written using PL/pgSQL:
Aside from the unusual choice of single quotes instead of $$ for the inner function definition, which is dictated by the processing requirements of Helm, nothing initially seemed amiss.
I revisited the excellent PostgreSQL documentation on creating functions and also consulted GPT-4 to see if it detected anything unusual. Both GPT-4 and the documentation confirmed that a function is marked as VOLATILE by default. This means that PostgreSQL will re-evaluate this function for every row it encounters in the tag table. Considering that our table contains hundreds of thousands of entries, this function would perform an SQL query to retrieve the same ID from the asset table every time, even though the asset table is typically static and only grows.
To address this, we introduced new versions of the function:
Switching to the IMMUTABLE function version yielded a dramatic improvement:
The execution time was notably reduced, as the function no longer re-executed with each row and the filter is now only ((asset_id = 6) AND (name = 'value'::text)) instead of executing the whole get_asset_id function.
This version assumes no modifications to the asset table entries.
If modifications are likely, the STABLE version (get_asset_id_stable) is recommended as it performs similarly by evaluating only once per index scan.
Further Optimization: Continuous Aggregates
While the aforementioned improvements are significant for most use cases, further optimization is possible by allowing the database to "pre-compute" results for specific queries.
In general, this can be achieved using materialized views in PostgreSQL. However, TimescaleDB offers us a more comfortable version suited for frequently changing data. A materialized view creates a new table that holds the results of a previously executed query, while a continuous aggregate automatically updates this table in the background as new data comes in.
The modifications to the get_asset_id function have proven very flexible; now, we need to evaluate what we actually query frequently.
In our case, we observed that for most Grafana dashboards, an average over 5-minute time buckets sufficed. (Although this can be adjusted for any bucket size, and you can even stack continuous aggregates on top of each other. Note: This feature is not yet available with UMH, as we use TimescaleDB 2.8)
Space considerations
The initial setup of the materialized view will necessitate an increase in the temp_file_limit (details below). A very safe upper limit for a modern HDD or better is approximately the same size as the data itself. Once the initial computation is complete, this limit can be reduced again (our current version of TimescaleDB defaults to 1GB).
The materialized view itself consumes only a minimal amount of data, as it saves only one data point per time bucket per asset_id/name combination.
Creating the continous aggregate
The following query creates a new materialized view without any data. The exact query might vary based on your application requirements. In our case, we generate averages over 5-minute time buckets:
You can omit the WITH NO DATA option to automatically populate it with data, but we found it easier to manage the fill process later due to potential complexities with large datasets.
Filling it with data
For the next step, we initially increased the temp_file_limit as TimescaleDB requires a substantial amount of temporary disk space for the initial creation. For our database, we set this limit to 80GB, having found that lower values (such as 1GB, 2GB, 4GB, 8GB) were insufficient:
Next, we invoke refresh_continuous_aggregate to update the data initially:
Since this is the first refresh, we just set the window_start (2nd argument) to NULL and the window_end (3rd argument) to LOCALTIMESTAMP. This will read all data and add it to our materialized view.
This operation may take a while, ranging from minutes to hours, depending on your dataset size.
If the function fails due to the temp_file_limit, simply increase it and retry.
Ensuring Automatic Data Addition
Lastly, we set up TimescaleDB to automatically refresh the materialized view as new data arrives using add_continous_aggregate_policy. This requires some fine-tuning based on your specific dataset:
The start_offset is set to consider the oldest data entries we want to refresh, which, although perhaps overly cautious, helps avoid issues with delayed data ingestion. The end_offset determines the newest data to consider, and we've chosen data up to one hour old because it's unlikely to change thereafter. The schedule_interval is aligned with our bucket size to maintain consistency.
Executing this command ensures that TimescaleDB keeps your materialized view up-to-date without manual intervention.
Returning to our query plan, we achieve an additional substantial speed improvement:
Permission issues
Should you encounter a db query error: pq: permission denied for materialized view tag_summary_5m when querying from Grafana, grant the necessary permissions:
GRANT SELECT ON TABLE tag_summary_5m TO grafanareader;
Bonus: Real Time aggregates
To handle the need for up-to-the-minute data without frequent re-computations, TimescaleDB supports real-time aggregates, which utilize the materialized view for historical data and compute live aggregates for new data:
Looking at the query plan:
The execution time difference is only very slight.
Summary
With the implemented changes, query performance improved drastically (see also comparison below). Users can experience significantly faster dashboard loading times by upgrading to version 0.14.1 of our UMH Historian or by applying the discussed optimizations to their existing setups.
Benchmark & Comparison
Checkout the comparison between 4 different queries in the video below.
get_asset_id_immutable - CA 5mis a 5m timebucket continuous aggregate with our fast asset id lookupget_asset_id_immutable - CA 3his a 3h timebucket continuous aggregate with our fast asset id lookupget_asset_id_immutableis just the optimized function without any continuous aggregateget_asset_idis the old function
The table contains 1.288.090 datapoints for our asset and 7.769.229 datapoints for all assets combined.
Comparison of different methods
¹: Always includes the latest datapoint, even if aggregated (real-time continous aggregate)
²: Continous aggregate and real-time continous aggregate only save one query, if you for example want to show the max instead of the avg, you need to create another continous aggregate
Benchmark results
Athena is our local development server & Dionysus is a Hetzner VM (CPX41) we use for testing.
All benchmarks where generated using fio-docker.


