Incorrect API Optimization recommendations by Microsoft

Incorrect API Optimization recommendations by Microsoft

On the page in Microsoft docs where they discuss API Service Protections there is towards the end of the page a part which gives some recommendations. Some are great, like the recommendation to use many threads and remove the affinity cookie, however when I read it I really bounced at the recommendation that batching shouldn’t be used. That just didn’t rime with my experiences of doing heavy dataloads to dataverse. So I thought I might just test to see if it was true or not by creating a simple script in SSIS with Kingswaysoft. My results, using batching compared to not using it gives more than a 10x performance increase. Continue reading to understand more about how I tested this and some deeper analysis.

Parameters and excel

The first thing I did was to create an excel sheet for storing all the results. I really did have to think about the different parameters that could affect the result, so I chose the following columns:

Dataload – how many records. This needed to be a bit larger to make sure that the throttling time of 5 min was passed.

Operation – Different dataverse operation take different amounts of time. For instance, creates are typically rather fast, but deletes, depending on table, can be a lot slower as the platform might execute cascading deletes based on one single delete. For instance, if you remove a contact with 100 tasks connected to it with the regarding relation set to “parental” or “cascade delete” it will actually remove all the 100 tasks. If set to “remove link”, the platform has to make an update to each of the tasks, removing the link. There are also special operations like merge which are rather complex.

Table – There is a large difference between the different tables. Some of the OOB tables have a lot of built in logic and really small non-activity custom tables can be a lot quicker to create, update or delete.

Threads – How many threads were used.

Batch – The size of the batches being used.

Duration / Duration (ms) – Duration is where I input the duration as a normal time. I created a calculation to calculate the corresponding amount of milliseconds.

Time per record (ms) – This is the division of the duration in ms with the total number of records. During this first test, I always used 100 000 records as the dataload, but it could be interesting in the future to see the differences between different dataloads, with all else being the same. This is also the main output from this test.

Strategy – It is possible to have different strategies. In this first version I just ran everything at once, hence I called the strategy “All at once”. Different strategies might be “5 on, 5 off”, meaning that you design the script to run superfast for 5 minutes, the throttling limit, and then stop and do nothing for 5 minutes and then loop this. Not always possible to use that kind of strategy, but for massive deletes of for instance market list members (cannot be removed with bulk delete) that might be an option.

API – There are currently two APIs that can be used. The new WebAPI which uses JSON payloads and the older SOAP API which used XML payloads. It stands to reason that the smaller JSON payload should cause the WebAPI to be faster than the corresponding SOAP API. However SSL encryption also causes the data to be compressed, which might make these differences smaller than expected. There is also a server side aspect to this, as the APIs will run through different parts of the code on the server side which could affect the performance.

No of columns – How many columns are being sent to the API. Of course there would be a difference if you send a create message with 3 columns compared to 30. Hence this is a relevant point. It is still a bit rough, as there is a huge difference in creating a boolean record, a 2000 character nvarchar or a lookup. This could also be something that was adapted.

Existing records – How many records existed in the system prior to running this? Not sure if this makes any difference, in other words, everything else equal, would it take more time to write 100k records to a system with 0 records or one with 10M records? As I don’t know, and cannot rule it out, I added it.

Latency (ms) – Daniel Cai, Founder of Kingswaysoft, always recommend that the SSIS script with Kingswaysoft be run “as close as possible to the dataverse”. That does in other words imply that the latency to the server affect the performance. Do calculate this, I used diag.aspx from the computer running the script.

Location – Which geo is the instance located in. This is more for general information, the latency is really the important factor here. The throughput might also have some affect if you are using a really bad line to the dataverse. I was using a wired 1 GBit line. In this test, I was using an instance I got hold of as MVP, which is located in the US and my own stationary computer at home (a AMD Ryzen 9 3900X 12-Core Processor 3.79 GHz with 32 GB of memory). Hence the latency was rather high and not in line with Daniel Cai’s recommendations. It is hence also something to investigate further.

No of users – As I, and some others in the community have described, throttling is based on a per-user and per-front end server basis. Hence utilizing several service principals/application users can effectivly multiply the throughput. In this test I used just one.

Instance type – It is well known that sandbox instances do not have the same performance as a production instance. If you find Microsoft support on a happy day and you are working with a larger (no of licenses) instance, you might also get them to relax the throttles a bit, especially if you mention that you are doing a migration. As these factors strongly affect the performance of large dataloads, I did have to add this. During this test I was using a non-enhanced production instance, in other words, a production instance on which no throttles had been relaxed.

DB Version – The final parameter that I thought might affect this is the actual version of the dataverse instance. As improvements and god forbidd sub optimal “improvements”, can cause enhancements or degradations of the performance, this is necessary to document.

SSIS/Kingswaysoft setup

For setup of create tests in SSIS with the Kingswaysoft addons I used a dataspawner (productivity pack) to generate the data. I then just sent this directly to the CDS Destination.

And the Data Spawner config:

And the CDS Destination config

The main changes done in this case were to the parameters “Batch size” (set to 20 in the picture above) and how many threads to use (also set to 20 in the picture above).

After each run, I checked the log from SSIS to see how long the entire process took. Due to the fact that I have a computer with many threads and for this case, enough memory, it is my perception that most of the threads allocated were also used.

Results

What are the results? This is a picture of the excel:

As you can see I did try both Create and Delete operations, but the results are rather obvious.

  • 20 threads/20 per batch of both create and delete, took around 45 minutes
  • Reducing to 16/10 made only a minor difference – 48 minutes
  • Microsofts recommendation of not using batching, ie 20 threads/1 batch – took over 10 h, both for delete and create.
  • Using only 1 thread with 1 batch was more or less the same as using 20/1
  • 1 thread with 20 in every batch (1/20) took almost 5 h, which is around half the 1/1 or 20/1

I think the results clearly show, that Microsoft docs are currently incorrect in their recommendation to not use batching. Perhaps they will update this soon. From an entitlement perspective, one needs to understand the additional cost of the “batch unpacking” request that is made. With 20 in every batch, this is an overhead of 1/21 but if you would lower the batches to 4, it would be 1/5. Hence using as large batch as you can, without loosing performance, is generally what I would recommend.

As I have implied in this article, there are a lot of other parameters to investigate in the API. I have a hunch that a create with 10 lookups compared to 10 textfields, will also make a significant difference, but I will need to test it.

Also do consider the request timeout. When working with complex and large batches, one request may taker quite some time. You will know, however, as it will return a timeout exception if you exceed it. Note that some records in that batch may have been written anyway. Just that your client wasn’t waiting around for the answer.

I do also encourage others to try out other parameters in the API. What is really optimal from many different aspects. From a mathematical perspective this can really be seen as a multidimentional surface where we are attempting to find the highest points. I have now started this journey, and I hope it was an interesting read. Please leave a comment, if you have any experience to share or just want to comment.

Impersonation – the Entitled future

Impersonation – the Entitled future

As I mentioned in my previous articles, I am trying to investigate the details of how the entitlements and API Service Protections are working and are planning to be rolled out (in the case of entitlements). I had a very interesting call with some of the nice people in the product team last which shed some more light on the entitlement issue and the best practice of how they suggest the API is to be used. The suggested method is that the API request load be spread out over the different users in the instance/tenant using impersonation. I will walk through what this means and what I think about this in the article below.

First, if you have not read my previous post on entitlement, I do suggest you do this first. It describes what entitlements are compared to the API Service Protection. I still see a lot of people mixing these up and that is not strange, but they are two different aspects of this, and we need to keep track of what we are talking about.

As mentioned in that article, the point of the enacting the Entitlements, when that is coming, which still is a bit unclear, is so that the compute consumed by a small organization is proportionate compared to a large organization. So, let us go back to the actual per-user licenses and have a look at an example.

Let us say we have a 5 000 Sales Enterprise org, that means that we get:

  • 5 000 users who each have 20 000 API request entitlements.
  • 100 000 API Requests for non-licensed users.

Compare this to a 10 Sales Enterprise org which will have.

  • 5 users who each have 20 000 API request entitlements.
  • 100 000 API Requests for non-licensed users

Both these are totally independent of how many instances the first or the second org has.

The first observation is of course that the 100k API Request for non-licensed users do not scale at all with the size of the organization or the number of users. How does this then go in-line with the goal that a large org should have more compute than a small? The second observation is that 20 000 API requests, which actual also the normal UI will be using, is very large. You would have to be one busy salesperson to be able to generate 20 000 API requests manually in 24 hours, so busy I am tempted to say it is virtually impossible to break unless you have very heavy automations running under your account. This was also what the Microsoft rep I talked to mentioned, that this large number is to be used on a per user basis. Hence the natural question was, if we use impersonation in the API, will the Entitlements honor that? The answer was unequivocally: yes.

Hence, this is the clear answer on how we need to create future integrations. We need to spread the load using impersonation over many of the users in the system.

Microsoft docs on how to do impersonation

If we do this the right way, it would probably be possible for most organizations to, over time be able to build a fix for this.

However, it will not be easy as we need to have a tight control of the privileges of all the users. Let me give you an example from a customer I work with:

They are an online travel agency and have people working at the destinations with very restricted privileges. A lot of bookings (orders) are integrated from the booking systems, these should hence be spread out over many users instead of the single application user being used today. There is not natural user to direct the bookings to, as it is a B2C business, and no person at the travel agency “owns” these customers per se, so the load needs to be distributed in a more randomized fashion. So, let us say we have these users:

  1. John Smith – System Admin (Full access)
  2. John Doe – Power User (can create orders but not refunds)
  3. John Surf Dude – Destination Specialist (can view but not create orders, cannot even read refunds)

When rebuilding the integration, we can use user John Smith and John Doe but not John Surf Dude and the only way of generically knowing this is checking what we want to do and comparing this to the privileges of each user to get a shortlist of users that can be used for integration.

However, we do not want to use a user that is close to 20k API requests for that day, so we might need to query the current API Request entitlement usage per user, so that we can filter the current shortlist to an even shorter list before knowing which users to use for impersonation.

Analysis

  • A way forward. I think this can be used, although there are some tricks to it. For my customer we might be able to cut a significant amount of API calls this way which will make a huge difference when we compared to not using this technique.
  • Impersonation not always viable – as in the example above, when there is not obvious owner to link to, we need to figure some other logic out of how to spread the API entitlement load. And things start to become tricky.
  • Impersonation is complex.
    • More complex dependencies on security model
      As mentioned above, trying to execute an action as a user that does not have the correct privileges won’t work, so we need to know that first. And setting everyone as System Administrator just will not work.
    • Logical user or just random users – trying to map the users to some logical connection from the other system or just randomizing the load. Logical user is probably preferrable but probably will not be a very common pattern.
  • Integration often system-to-system not user-to-user
    • Integrations are more often done on a system-to-system basis, not user-to-user basis. When looking at CRM-ERP integrations for instance, the user base of these two systems seldom overlaps except for a few users.
  • Takes time to refactor code to handle impersonation – There are many organizations out there with numerous complex integrations. And changing integrations on this level will require significant work to be done and the question will be if there is time to complete this work before the entitlement feature goes to GA?
  • Strange audit trail – if we use randomized users to update or create data in dataverse that will undoubtedly create very strange audit trails, created by and modified by fields. These are some facts that need to be taken into consideration.
  • Power App – per App users have very few requests – Not all licenses have 20k API requests per 24h. The Power App per App has only 1000 API Request entitlement per 24h, these can run out just by a using the system heavily. So do consider the API Entitlements when looking at the licenses.
  • Still not GA – Entitlements have still not gone GA. Hence the best time to let Microsoft know what you think is good or bad about this is now. But do be civil, there will be some feature like this, that will handle fairness management of compute consumption. Contact Microsoft through your local User Group, your local MVP or via the comment below or send me a message on LinkedIn and I will put you in contact with the right people. You can also submit an idea to the idea portal.
  • There might be a point to binding all entitlements to users, in the case that if, in the future, any overshooting would not only result in angry emails, but service degradation or shut-off for that user. Imagine having creative citizen devs creating some infinitive looping Flow or massively recursive logic unknowingly which causes a lot of requests. This approach would then just cause a block for that user, not the entire tenant. Significantly reducing the severity of the problem.

Suggestions

Personally, I think this method is just way to complex. I think just having a simple pooling on the tenant level of all the API entitlements would be fair and then deducing all usage from this. I think that Microsoft could skip the 100 000 for the non-licensed user, for simplicity. Based on the examples above, that would make:

5000 Sales Enterprise

  •  5 000 users who each have 20 000 API request entitlements.

Total API Entitlement for the Tennant: 100 M / 24 h

5 Sales Enterprise users

  • 5 users who each have 20 000 API request entitlements.

Total API Entitlement for the Tennant: 100 K / 24 h

And all users, and all non-licensed users use from the same pool.

As for the potential problem of creative users potentially blocking the entire tenant, I would suggest adding a “per user” API request limit, which can be changed by the admins, but by default is set at exactly the same as the entitlements. That would allow admins to reduce the limit to 10k for enterprise users, to ensure the server-to-server integrations were still enabled in a proper and entitled way.

I think this would align with Microsoft’s goals and make it easy to understand for customers and we do not have to rewrite tons of code and make strange workarounds. But maybe there is something I am missing. If so, and you see it, please leave a comment! 

Why I love Kingswaysoft

Why I love Kingswaysoft

The reason I often recommend Kingswaysoft over other methods of data migration or even sometime data integration is rather simple, it can do what others can’t. Not even Microsoft. And it can do it fast.

So have I been payed by Kingswaysoft to write this? No, not a penny, I havn’t even been given a free license even though they might if I asked. I think it is just fair that I explain why I am such a strong proponent for this product and if anyone disagrees, please feel free to drop a comment below.

API knowledge

Kingswaysoft both know how the API:s of the Power Platform/CDS/Dataf??x (called CDS below) work, sometimes even better than Microsoft themselves as I have seen feedbacks given in such detail from them on what is missing from specific API:s to reach feature parity, that it is scary. Kingswaysoft also has a blog which has details recommendations and built in recommendations in the product on how to maximize performance with the Dataflex API. They have also built in handling for throttling, multi-threading, batching and more to just make it transparent. When looking at other integration products and connectors. I have never seen anything that is close to this depth.

Datamodel and Dynamics knowledge

Dynamics 365 Sales, Customer Service and other first party apps have some very peculiar oddities that need to be taken into consideration. These oddities range from for example:

  • How to delete Marketing List members – by using the party member and list as key
  • How audit logs are handled
  • How Activities work, with activity pointer and activity parties
  • Setting CreatedOn/CreatedBy/ModifiedOn/ModifiedBy
  • Reading/Writing personal views (saved advanced finds)
  • Setting statusvalues of some fringe entities

Even though many tools, like Power Automate, Data Flows, Azure Data Factory, etc. all seem to handle common tasks like creating or reading a contact or account rather well, it soon becomes a problem when you start looking at some of the areas above. And you often need to in a migration.

It can use the power of SSIS

SQL Server Integration Services is a very powerful framework once you start getting your head around it. Yes, it has some quirks to it but in general it can do some rather cool things and is good when you want to sequence many different tasks in order to reduce overall runtime. Using built in features like Cache Transforms with memory storage, you can make filters using memory based lookups with millions of records super fast, once everything is loaded.

…and extend it

And of course Kingswaysoft didn’t settle for just building connectors to Dynamics 365. They have a pack called “Productivity Pack” that adds a lot of nice features to SSIS that makes your day a lot easier.

And if you have problems – great support

And if you ever run into problems, their support is great. We have identified buggs and they have fixed the bug and sent us a special deploy just a day or two after.

Flip side

There is always a flip side. I think one is that SSIS isn’t always your most stable product. For this I don’t think Kingswaysoft are to blame but it affects their product experience none the less.

Another obvious flip side is that this requires a skillset that is a bit “off the tracks” even though it isn’t super hard to learn the basics, becoming really proficient with SSIS takes time.

Finally

So, a product as good as this is probably super expensive, right? No. It isn’t. If you are running things from within Visual Studio, then you don’t even need a license, but if you plan to run a migration or so, I certainly recommend one anyway, to get access to support. And buy the Ultimate license to get access to the productivity pack and all the other connectors while you’re at it. It is worth it.

Just to be clear, I am not saying that the other products are bad. I am just saying that once you choose a product, you will start to decend the rabitts hole. If the product isn’t up to speed you will have a couple of options:

  • Back up and redo with another technology
  • Patch with another technology
  • Create some kind of workaround

Neither of these solutions is rather palatable. Hence it is often tempting to choose a product that you know can do the job. And apart from coding, SSIS with Kingswaysoft will very seldom have any issues.

Team member licensing – hammer coming down!

Team member licensing – hammer coming down!

The team member licensing option is something that has been a subject of debate for quite some time. What can it actually be used for and what not? I have heard “experts” suggesting that it be used for integrations but if your read the Dynamics 365 Licensing Guide, appendix A, you can read all about what Team member licensing is and isn’t. Typical scenarios for Team members are:

  • Read only users
  • Users that use only slim parts of the system, and not the “cool” First-Party-App features in Sales, Customer Service, Field Service etc.
  • Users that just track activities

There are some updates to the details of what a team member can and cannot do, and I think the most important is that a team member cannot CUD (Create, Update or Delete) accounts any more.

How urgent is this? Well, for new instances being created, this will be enforced as of April 1 2020. But for existing instances, it will be enforced on July 1 2020. Hence, you still have some time if you have an exiting org.  

You can opt-in to early access updates. Goes without saying that you shouldn’t do in your production system, and probably not even in you dev/test environments since as soon as you have you won’t be able to depoy changes. So, having an out-of-ALM-environment where you can test this might be a good idea. 

So, the BIG question is, “Are we compliant?” 

Well, there is actually a report that you can generate from the Power Platform Admin Center and I have recorded a video below that shows how you can use that and collate the data with pivoting in Excel.

Top Table Usage in PPAC

Top Table Usage in PPAC

Top Ten Table usage is back from being lost when Organizational Insights was discontinued. A bit tricky to find so check out the video. Awesome tool when trying to reduce the size of large instances, especially important now that the prices per GB are going up to $40/GB (subject to you license agreement).