Monitoring Series: Database Size Report

ServiceNow instance performance is a cause of concern and notice for any ServiceNow admin. To get ahead of the curve, it is a good idea to keep an eye on your instance database size.

Database size, the sum of all the tables and table index data stored in the instance database server. It’s easy to ask ServiceNow directly or fill out their self service form on the Support page, but even easier to just make a report and put in on a dashboard yourself!

Step by Step

  1. First go to Reports > Create New

  2. Set a name like “Current Database Size (MB)”, and select the table UA Instance Info [ua_instance_state].

  3. Select type as Speedometer (you can also do single score, but it’s nice to set ranges!)

  4. On the configure tab, select Aggregation as Sum and field as “Primary database size (MB)”

  5. On the style tab, you can set the direction as Minimize, and set reasonable ranges you think your database size should fall under.

  6. Finally, you can go and add it to a dashboard.

The finished result should look something like this:

db_size_detail.PNG

This report is an excellent addition to any system administrator’s dashboard.

Faster API than GlideRecord?

What if I told you there was an even more optimized way too get data from a ServiceNow table on the server side besides GlideRecord?


This idea stemmed from my research on how the Table REST API allows you to just select certain fields from the table, and that optimizing your query significantly.

There is a similar trick using GlideAggregate, and just adding each field individually, like you would for counting. Props to Tony DiRienzo for posting this on the Community.

var ga = new GlideAggregate('sys_user'); ga.groupBy('user_name'); ga.groupBy('sys_id'); // This makes sure we get all records, even if there are duplicates ga.query(); while (ga.next()) { // Do what you need to with the user_name field var userName = ga.getValue('user_name'); }

In many cases this type of query runs up to 5 times faster, and works better for fewer variables. To expand it, all you need to do is add another groupBy() for each field you need included.

Only caveat here is that you lose the performance gain if you dot-walk in the results.

This is just another great tool to add to a developers toolkit to build ServiceNow applications.

Why Deleting User Records is Slow

If you’ve ever spent any time as a system administrator with ServiceNow, you’ve probably had to delete a User record, either as demo data, created in mistake, or to clean up the data in general.

The Problem

Deleting data on the sys_user table takes a really long time, to the tune of minutes per user.

The Why

This is because of how many table column references there are to the user table, and how with cascade delete rules, that every one of those columns and related records must be queried for each user deleted.

The Fix (or closest to it…)

The easiest first step is to run use the setWorkflow(false) function when deleting records. This stops any notifications or business rules from firing.

Next, delete a single user using a background script and look at the detailed output. If there are messages like:

Could not find field u_userref in table u_example during cascade delete

Locate the sys_dictionary record for that column and delete it. Sometimes there are leftover user column references when the column is deleted, but the record still remains.

Next look at all sys_dictionary user references, and review the Cascade Delete rule settings for those columns. Consider changing some of the columns to None if maintaining the reference is not important. More information on Cascade Delete rules can be found here: https://docs.servicenow.com/bundle/paris-platform-administration/page/administer/field-administration/task/t_CascadeDeleteRules.html

Once all those fixes have been applied, you can see a drastic improvement in the speed of the deletes. Even cutting the time in half!

Extra credit: ServiceNow provides an SQL Debugger tool which can help you identify where the slowest parts of the DB operation are, including Cascade Delete operations.

Best Practices: MID Server Administration

Below are some of my personal thoughts and advice on MID Server configuration, set-up and maintenance. Some of these are pretty well known and adopted, and some are based on my personal experiences.

  1. Basics

    a. Sizing / Server Specs

    Specification wise, RAM/Memory and CPU Core Count/(and associated threads) are the most important. Usually 8GB RAM, 4 Core CPU is recommended with 40+GB disk and 64bit OS which is a requirement being enforced going forward. Within the MID server parameters there are options to use more RAM, Threads, etc. Generally specifying to use more RAM is recommended, and using more threads, results can vary.

    b. Network (Bandwidth and Location)

    This point is the key reason that MID servers are needed in the first place. They are a necessary appliance so that you can securely interact with other resources and devices on a network. It’s important to have fast connection speed (usually 100mb/s upload is recommended), and put the MID server physically close to the devices it is interacting with, for quick and optimized interactions. If you have multiple data centers, the servers should be spread out in those data centers, and put ones on similar subnets as the majority of devices.

  2. Segregate by Purpose

    a. History

    The biggest flaw I’ve always seen with enterprise MID server set ups, is that there is no separation between Discovery and Orchestration usage. Many Discovery credentials are server admin or granted powerful rights to log into machines and run discovery commands. There are a couple ways you can force orchestration tasks to use specific credentials, using credential tags, and hardcoding MID server IP relationships with the target Orchestration device. However you can’t lock down the Discovery credentials to only be used for discovery, which can be a major security issue.

    b. Resolution

    The only guaranteed mechanism for locking down those credentials is to lock them down to only be allowed to use on specific MID servers. Additionally on those MID servers you have the options to not allow Orchestration application. Thus it’s critical to have separate MID servers for Discovery, Orchestration, JDBC and other activities to enforce proper credential usage.

  3. Security

    Review the MID Server Hardening Guide.

    There are a number of overlooked recommendations, for setting specific security parameters, like disabling SSL, TLS 1.0 and 1.1, setting the DH group value and encrypting proxy passwords (if applicable).

  4. Closing Thoughts

    Besides just initial setup, every administrator knows that you have to keep current with upgrades, restarts, and rekeying the credentials. Besides those standard activities to take it a step further, you can set up proactive monitoring, such as using the built in MID Server resource threshold alerts, or advanced tools like PerfMon, Microsoft SCOM or Datadog.

Hope everyone learned something! I’ve been absent, but hopefully we can start off 2021 right and have a lot of content. Please comment your ideas or anything you want to see!

Monitoring Series: Research into ServiceNow Performance Dashboard

Below are some of my ramblings, thoughts and research into the ‘servlet performance metrics’ dashboard from ServiceNow, and how this custom UI page homepage really works.

servlet performance.PNG

Observation #1 - The Front End is in the Perf Dashboard Plugin

This dashboard is part of the core system plugin, ‘Performance Dashboards’ (com.glide.performance_dashboards'), which is an extremely lightweight plugin for just housing the shell UI and a couple supporting scripts.

Observation #2 - The Library used is JRobin (Derived from RRDtool)

Within the scripts you can see data references to tables starting with jrobin, and those point to the JRobin Plugin (RRDS), which is just a Java implementation of the RRDtool system. They even kept the Robin Robin heritage by letting the table labels start with ‘Round Robin’. It’s worth noting that you can’t see any of these jrobin tables OOB, they are locked down to maint, I had to go into each individual read ACL, and add roles to be able to view it.

rrdtool.png

Observation #3 - ServiceNow uses an RRDTool Database to Store Monitoring data

This leads us to a another discovery, that all this data is being parsed from an Round Robin Database (RRD), and then there are supporting tables in ServiceNow which define the data refresh intervals (spoiler, they all refresh every 2min), and information about the Round Robin Archive. I found a good introduction to RRD here.

Observation #4 - ServiceNow Undocumented Monitoring APIs

While snooping around in the client side javascript, I found reference to the APIs which are called to provide the data. The sys_id’s needed to call these APIs are the in jrobin tables, and there is also other client side parameters.

p = "/api/now/v1/performance_dashboards/data_categories",
f = "/api/now/v1/performance_dashboards/graphs/overlays",
m = "/api/now/v1/performance_dashboards/graphs/category/<dataCategoryId>",
g = "/api/now/v1/performance_dashboards/graphs/<graphSysId>",
y = "/api/now/v1/performance_dashboards/events",
v = "/api/now/v1/performance_dashboards/nodes",
_ = "api/now/v1/performance_dashboards/suggestions"

Future Observations…

I would like to look more into the list of jrobin_graph_line’s and understand how the aggregator relates to the data source (jrobin_datasource).

I want to do some testing and see what format and parameters are needed to use those Performance Dashboards APIs.

I find it interesting how rrd4j appears to be the more widely adopted java port of RRDTool, vs jrobin. I could see ServiceNow eventually move to this, if they don’t discontinue or entirely re-structure their db monitoring backend. But ServiceNow has stuck with JRobin since 2006, so I find it doubtful it would change any time soon.

Deleting Fast in ServiceNow

Another post dedicated to the die hard ServiceNow developers. Have you ever wondered how fast (or slow) the deletion methods that ServiceNow employs are? I’m talking about GlideRecord deleteRecord, GlideRecord deleteMultiple and GlideMultipleDelete execute (undocumented). Today we’re going to rank all the methods, and put some numbers behind the mystery.

It’s no surprise that deleteRecord does the worst due to the looping overhead. It’s surprising that deleteMultiple comes out on top of execute by a small margin of 74ms.

It’s no surprise that deleteRecord does the worst due to the looping overhead. It’s surprising that deleteMultiple comes out on top of execute by a small margin of 74ms.


Due to personal experience, I’ve noticed that some sources are prioritized in terms of deletes, so again here are numbers but with different sources overlaid on them. I ran tests with 1k, 10k, 50k, 100k, and 200k to get a fair representation.

This is the summation of about 350k record deletes, and what method handled the varied delete batches.

This is the summation of about 350k record deletes, and what method handled the varied delete batches.

In my personal experience again, I had a gut feeling that the workflow activities are somehow optimized for a slightly less deletion overhead - and script actions as being relatively slow.

Delete Multiple - The fastest method by order of magnitude of 10

GlideMultipleDelete execute - The second fastest method (10x slower than DeleteMutiple, but very fast)

Delete Record - There is absolutely never a good reason to use the method ever. It’s 1,270x slower than Delete Multiple. Unless you just want to waste everyone’s time. We’re talking 1s versus 24min difference for 200k deletes.

Note: The numbers shown were all run on a personal developer instance with no other processes running. Additionally I was running these delete methods with basically no filters. Filters will extremely slow down your deletion rate as well as how big the table is you’re operating on. Your numbers could vary based on node count, jobs running, active users, and the database server load and capacity as well. They could also vary depending on what else is running in your system (ie if your event log is overflowing, script actions will be slower).

Want to validate my results? Here is some sample code you can adapt.

    deleteUsingGRDelete : function(){
        this.startTimer();
        var grTestData = new GlideRecord(this.tbl);
        grTestData.query();
        while(grTestData.next()){
            grTestData.deleteRecord();
        }
        this.endTimer('GlideRecord deleteRecord');
    },
    deleteUsingGRDeleteMultiple : function(){
        this.startTimer();
        var grTestData = new GlideRecord(this.tbl);
        grTestData.query();
        grTestData.deleteMultiple();
        this.endTimer('GlideRecord deleteMultiple');
    },
    deleteUsingGlideMultiple : function(){
        this.startTimer();
         var md = new GlideMultipleDelete(this.tbl);
         md.execute();
        this.endTimer('GlideMultipleDelete execute');
    },
    startTimer : function(){
        this.start = new Date().getTime();
    },
    endTimer : function(test_name){
        var end = new Date().getTime();
        var dur = end - this.start;
        gs.log('Test '+this.size+' '+ test_name + ': '+ dur,this.type);
        var grResults = new GlideRecord('u_deletetesterresults');
        grResults.initialize();
        grResults.u_record_count = this.size;
        grResults.u_test_name = test_name;
        grResults.u_source = this.source;
        grResults.u_execution_ms = dur;
        grResults.u_execution_time.setDateNumericValue(dur);
        grResults.insert();
    },

Further areas of research to consider… If there is enough interest I may do a follow up post.

  • Historically ServiceNow has recommended the Table Cleaner (sys_auto_flush) for mass deletions. The deletion times on that could be slightly more optimized.

  • Audited & Update sync table deletions could be much slower and heavier. It would be interesting to see the measured impact to deletions by adding these options.

  • Number of columns and column types on the table. I did mine with the base system fields as well as a string description field with size 80. If you added 2 string fields with size 4k, would it make a measurable impact to deletions? Also does null data delete faster vs filled columns or is it treated the same?

One final thought… The most optimal I measured for large data deletions was about 152 deletes/ms, or about 152,000 deletes/s on my ‘clean’ instance and simple table without filters. I would recommend running some deletion baselines in your instance to see what you can expect out of performance.