I have recently been working with a customer with a large (500+ GB dataverse db) production instance and we are attempting to cut this down quite a lot. The natural way of doing this is bulk delete but it seems that if the underlying SQL isn’t up to speed, your jobs might end without actually being done.
My customer has a production instance of over 500GB which is costing them some substantial money and hence we started looking at removing some of the less necessary data. In this case it was the order products (salesorderdetail) which we have perceived that we do not need more than one year after delivery. Hence we set up an advanced find, agreed on the exact filters, saved the view and then tried to remove the records (well over 5M) using bulk delete. I restarted it several times. You can see the chronological results in the screenshot below:
As you can see, sometimes it actually deleted a few records, but most times, it didn’t. My professional analysis of this (=guess) is that this is caused by bulk delete not handling exceptions like SQL Timeout properly or that it has a limit on the number of times it will retry.
I also, in parallell tried to remove the same records with the same FetchXML using SSIS/Kingswaysoft and here I have several times gotten this error. I have had to turn down the knobs to a very low setting to get it to work, but the error message I did get was:
(Status Reason: ServiceUnavailable): The remote server returned an error: (503) Server Unavailable.”.
As you can see, there is a lot of noise but it clearly, in the highlighted part, say that there is a SQL Timeout.
Hence my takeaway from this is that you need to be a bit wary of bulk delete in large instances or in general as it might indicate that it has completed successfully but in fact it stopped due to SQL Timeout (or some other platform related issue).
On a personal note, I really hope the bulk delete functionality gets a modernized revamp soon. It is really old and is becoming more and more relevant.
Storing data in dataverse is very expensive. Especially the data that is stored in the actual database (db data). Hence, for many customer with larger datasets, typically with some B2C type of business, it is a good practice to overviewing the data you have in the system and figure out different ways of keeping it from costing too much money.
Below are 10 tips you should consider to keep the data in check.
Make a deletion list – and set up deletion jobs
Flatten verbose fields
Use virtual tables
Use datalake with analytics
Use datalake for archiving
Compress complex structures
Clean up non-production environments
Set lifecycle for instances
1 – Make a deletion list and set up deletion jobs
The first thing you need to start doing is to go through the biggest storage contributors. You can find which these are by looking in the Power Platform Admin Center under “Resources -> Capacity”. Open the view of each of the dataverse instances by clicking on the small chart symbol next to the instance.
You can export the full list by clicking on the hamburger symbol in the right hand corner of the graph. See picture below:
You can then start by trying to break down each table from the top. Based on the picture above, ask yourself questions like;
Do we need to store and save all orders? By removing some orders, maybe cancelled orders, we can cut down on two of the three largest tables as the order rows are removed at the same time
What activities are really causing activitypointer (the common table for all activities) to become so large? It is quite common that the email table is a culprit here, as the body of the emails, is usually a quite a few bytes and with thousands or tens of thousands of emails, they do add up. Marketing integrations can also bloat the activitiy pointer. I suggest investigating this further by exporting the data to an Azure Datalake and analyzing it with PowerBI. It is also quite common that not all emails need to be saved, for instance if you have email enabled queues like email@example.com it is quite common that you get some spam into this. Maybe searching for “unsubscribe” in email body to see if there might be some newsletter looking spam which enable you to remove them.
Are there any patterns of contacts and/or leads that can be removed? Working with B2C you might find that there are quite a lot of contacts and leads with incorrect data or similar. Ask yourself what value a lead with no phone number and no email has… removing contact data will also remove customeraddress as there will be at least two customer address records for each contact automatically created. The same goes for leads but in that case it is the leadaddress where extra data is being created.
The point with the list is trying to identify possible patterns of data that can be deleted. I typically have one row per rule like so:
status code = inactive AND modifiedon older than 6m
body contains “unsubscribe” AND modifiedon older than 1m
Flatten w Power Automate
Outgoing AND subject = “Covid information”
2 – Flatten verbose fields
Sometime it can be a good idea to just clean out or some “body” data. For example if you have sent emails to a lot of customers with the same Covid related information, the actual content of that email is the same for all and is known by everyone. Hence the body can be removed which can typically be done with a Power Automate Flow or SSIS/Kingswaysoft depending on the size of the data.
3 – Use virtual tables
Not all data needs to be stored in the expensive dataverse database. The recommendation from Microsoft is typically that data which you interact with should be in dataverse and the rest can be somewhere else. As it is now possible to use SQL as a source for virtual tables, as well as Cosmos DB and many other sources, moving data out of dataverse and accessing it through virtual tables can be a viable option.
Virtual tables is also something that should be considered when designing integrations. It is not always critical that all data actually reside in dataverse. Sometime just being able to access the data through dataverse is good enough. Especially if the data source is fast, it isn’t that much data, the data only needs to be accessed seldomly, It is also easier to use virtual tables if the data is read-only and used for reference.
4 – Use datalake for analytics
Dataverse isn’t really a good source for analytics. All endpoints (even the T-SQL) go through the application layer to the database. This makes large data crunching less than optimal in dataverse. This is also why Microsoft have developed methods for replicating the dataverse data to external stores. The just discontinued Data Export Service (DES) which synchronized data to an Azure SQL and the new Azure Synapse Link which synchronizes data to a datalake, are clear examples of this.
Hence, a good architecture for analytics of dataverse data is doing the analytics outside dataverse and the current best place for this is in a datalake with Azure Synapse Analytics.
This also has a direct implication on the data stored in Dataverse, the data you have in dataverse that is only needed for analytics, might not actually be needed in dataverse anymore. This can be anything from old orders, and of course all old communication like emails and phone calls. Datalakes are also way more efficient for doing large scale AI/ML analytics, which many organizations are looking for.
5 – Use datalake for archiving
If you havn’t already thought about it, a datalake can hence be used as an archive for data that you might need but isn’t actually something that you will use every day. Often a lot of data is stored for compliance reasons. When setting up archiving, it is important to make sure that what ever method you use, you probably want to keep the data in the archive after it has been deleted in dataverse, hence you might need to move the data from the initial storage container to the actual archive making sure to not actually move any deletes.
Microsoft have announced that they will be putting their own hot/cold-storage solution for dataverse into public preview this spring. This sounds a lot like some archiving functionality and I am certainly looking forward to seeing what it will be, how it will work and the licensing for it.
6 – Compress complex structures
For some businesses it is often needed to have quite a complex datastructure to handle the operations of what is done. It can be complex order structures with allocations of order rows to specific users or product configurations with billing, logistics and provisioning settings. Once an order has bee fullfilled in all aspects, it might not be required to store and keep all that complexity. It might just be required to keep the order header and the most critical information about what the order rows were about. Perhaps even flattened into a JSON or text field for future reference. If it is possible to move from a order header + 10 records in related tables to just a order header with a summarized text, for thousands of orders, than can save quite a lot of space.
7 – Clean up non-production environments
Data isn’t only stored in the production instances of dataverse. Many times production instances are copied and turned into test or staging instances. Integrations may be running towards development and test instances generating substantial amounts of data. Have a look at your non-production instances and check which data you actually need in them.
8 – Revise datamodel
Sometimes datamodels can become unnecessarily complex. This is most common when someone with little experience of dataverse and model driven applications design solutions. Other typical problems I have seen are repurposing of tables like leads, accounts, sales order to other simpler purposes. The problem from a dataverse storage perspective becomes that the additional storage overhead in these cases, especially in cases with businesses that have a lot of customers, can be rather drastic. Redesigning the datamodel to be more sleek can make it both more user friendly and use less storage.
9 – Revise integrations
Integrating data to dataverse can be a large culprit for data. For instance, many marketing applications generate quite a lot of data regarding the behaviors of customers. This data can then be integrated to dataverse to be used there. As behavioural data can be very granular, to the level of “Customer x has be send email y”, “customer x has read email y”, “customer x has clicked link z in email y” this data can take up substantial amounts of storage if integrated to dataverse. What options exist for this are different for different marketing applications. For ex Adobe Marketing can shut off this integration and can instead synchronize behavioral data to an Azure datalake where it can be unified with other dataverse data.
ERP integrations are also a common area of problem. ERP systems often have a very deep granualar level of data which might not be needed in dataverse/Dynamics 365 CE. Sometime just enabling deeplinking directly to the ERP system can be a better method, combined with virtual tables. However, this is not always the case and sometimes the best solution is just synchornization of the data.
However, do recognize that there should be a reason for having the data in dataverse. It should genrally be actively used.
10 – Set lifecycle for instances
Having dataverse instances with a bit unclear usage and not knowing what they are for and if they can be removed is, of course, a source of quite a lot of data consumption. By clearly assigning owners, reasons etc for all instances it makes answering the question “Do we really need ‘Internal test applications sprint 7′” or is it just an old remnant from a two year old development. This is part of the Center of Excellence starter kit and the processes that are important to set up in relation to this.
I hope any of these 10 tips have been useful. If you think so, the best way to thank me is to tweet or share your thoughts on this in social media. Also feel free to leave a comment below.
Perhaps you have some other thing that you think should be done or that I missed something. If so, please share in the comments below.
I was recently in charge of a large migration. It all went fine but not without hickups that typically are connected to moving large amounts of data to dataverse. We were using SSIS with Kingswaysoft and ended up using a local SQL database as staging database too. This article will discuss the different lessons learned and give some concrete tips when doing similar migrations.
One of my more popular articles is the article that describes how to optimize the writing of data to Dataverse/CDS. If you are working with migration of large amounts of data, as I will be describing here, I do suggest you have a look at it: https://powerplatform.se/fast-data-management-in-a-limited-cds-world/. I will not discuss those concepts in any detail here but we did use all aspects mentioned in that article.
I recently was in charge of a migration which used CSV-file exports from an old German system (with German field names!) which had many millions of records, in both large tables like “Contact” and “Sales Order”. However, the system we migrated from had a completely different data modell than the one used in Dynamics. For instance, each row describing a “Flight” had to be divided into two rows, one for the outgoing flight and one for the homecoming flight, in the order detail table. We also had to create a lot of related data which was referenced from the “Flight” table, for example location, agent and brand. In other words, there was quite a lot of heavy transformations going on and a lot of logic involved, such as change format on the old data to match the Dataverse model and apply rules to resolve old issues, such as bugs.
Initially we only got a quite small subset of the entire database load, and we started our migration journey by creating all the migration logic in SSIS (which facilitates the script and makes updates easy to handle). The script did include some functions that “joined” rather large tables, both from the CSV files but also related data fetched from the Dataverse based on primary and alternate keys. I was clear with the customer from the very beginning that I wanted a full export with the same amount of data that we could expect in the final migration, mainly for the opportunity to stress-test the SSIS script before the migration to the production environment took place and after a while we got the big files…
…And this was when the excrement hit the wind generator. The afore mentioned lookups just stalled forever. We noted that having a lookup (using Kingswaysoft Premium Lookup) works fine on a computer with 16 GB memory up to a few 100k of records. However, once the data starts reaching 500k and more, it just stalls forever (and don’t even get me start on the sort tools…). Not sure exactly if it would have been possible to fix this by adding more cores and memory, we didn’t try. We hence had to rewrite the script and implement a staging database instead. What we found, is that a dataflow with 1M+ records of lookups will be 100x faster if you import the data into SQL and do a join instead. Lookups still works for smaller tables and I am not against them per se, as they do make the migration simpler. Adding more tables to a migration database will increase complexity, and if you want to add a column in a table, that table do not only has to be added to one SSIS dataflow, but probably a few more. And you also must do an ALTER TABLE in SQL to add the field there too. It is therefore important to have a good mapping set before you start to create the script. And keep the complexity as simple as possible. You can also use SQL tasks in the migration script to update the tables straight after you read them to the staging database per automation, if you need to apply some kind of rules after the read to the staging database, and find it easiest with an SQL query.
The method we used for developing the migration was to first make a “skeleton” migration, based on the target data model. In other words, we started with trying to get a few of the easiest fields, not all, from all tables that was to be involved in the migration – maybe it could be called – model-first-approach, instead of starting with one table, completing this and then moving on to the next. The advantage of the model-first-approach is that you quite early can start some tests on the data, for instance setting up some quantitative test by checking in the source system for the quantities of contacts and then comparing these quantities to the target. The tests can typically be done by other people than the people building the migration scripts and hence this methods scales a lot better than table-by-table-approach. It is also possible for several devs to work in parallell with different tasks. Typically the more senior will build the skeleton and then more junior can add fields by field to each respective table. A negative aspect of this approach is that it requires a lot of re-loads (keep in mind that this was a first migration, so there are no prior data in the Dataverse that we needed to consider) and re-mapping. And it may be easier to “fall out of” the structure, if you just need “to add a little bit here and there”. It is however indeed hard to go table for table, especially with related data. If you already have a lot of live data, you should think about a way to easy identify the migrated data so you can bulk-deleted. And do not forget to engage the client early with raised questions and the mapping to make sure you have understood everything correctly and avoid unnecessary errors.
We also tried to create unique row identities that strictly was based on the source data. This is very useful as that allows for delta-migration, or to continue where we left off in case of a problem. Let’s say for instance that you want to migrate 3 million contacts. If, after 2.1 Million contacts the script breaks for some reason, it is good to be able to continue at 2.1M instead of restarting. In this case we didn’t use modifiedon-date to be able to do a full delta migration logic but it is certainly possible. For this we used the cache-transforms, easily fetch the already migrated data (if any) with the unique and sort out the already migrated data if it matched the key.
Another pattern that we used was that, after creating a specific record, like contact, we reimported the recordid (in this case contactid) together with the legacyid. This allowed us to directly join with this table when later adding tables with dependencies like lookups towards the contact table, could be joined with this mapping table so that we directly got the contactid when querying the related table.
When migrating from CSV, import them directly as source tables in the staging database. That way, in case you need to fix something, you have a good reference for quantities.
Get an example of the full data load as early as possible. A script that works for a subset might not work at all for the full dataload as was the case for us.
Automate as much as possible. Don’t use any hardcoded values that are environent specific, such as transactioncurrencyid, but rather read these to small tables or to SSIS variables. Use SQL Truncate to remove all data quickly in a table, and make this part of the SSIS script as an SQL task at the appropriate stage.
Always check the quantities. How many rows in source data, how many rows after a match and check if it differs so you very early can identify bugs in your script that might be the reason for dropping rows. For example, you might use a JOIN when you should use an OUTER JOIN. Always check the total number and see if it is what you expect. Watch out for duplicates, and always check so your unique IDs (if you got some from the source data) really are unique and not NULL. Do note that if you have duplicates, that you join on, that will create multiplications. Hence it is possible, after a select-statement with joins to get more records that the initial table.
Define reasonable goals and test cases for the migration. Some examples:
99.9% of all contacts to be migrated correctly. With 1 M records, this means that anything lower than 1000 incorrect migrated contacts/missed, is defined as still ok.
Randomly pick 10-20 records on a base level, like 20 customer, and then compare these in the UAT/Test environment to the source system, as it is seen there. This needs to be done by the business people, so that they can have a say if the migrated data is fine.
Select some filters, like “all customers in Munich” and some other segmentations and compare source system to destination. If there are large amounts of errors, backtrack to the staging database to see where you did loose some records or created too many (not uncommon).
Complete entire transformation to destination tables in the staging db. Then you can move directly from there to dataverse. This is particularly important when moving large quantities of data when managing the data in SSIS can be problematic.
Make sure to have unique identifiers on all tables that preferably can be regenerated from the data. Store these in some “Legacy ID” field. This allows for delta-migration logic, ie. where part of the data is migrated and then the rest later. If you have some issues during one of the dataflows, and it stops on 3 230 234-th record of 6 M, you can continue from there and you don’t have to redo it all. If there is no decent way of getting a legacy id, you can generate classic row numbers by creating an identity column. This will make the migration utilize this, but only within that particular instance and load of the staging db. Hence you must be careful everytime you reload the database.
Utilize the backup-restore functionality of the dataverse environments. Do note that you can make manual backups just before you start migration. If you have a production environment, this will need to be converted to a sandbox environment before you can restore to it. Another option I got from a colleague was to use 3 different environments, with temporary names, and then just rename the final one when done.
Once you have transfered an entire table to the source system, it is typically very useful to have a mapping table, with just the table record id and the legacy id. So for instance, after migrating Contact, read all contacts from dataverse with the contactid and the legacy id. That way, when later migrating “salesorders”, which identify the customer by legacy id, it is easy to just join with this table to get the contactid.
Production environments are faster. Fastest is to ask Microsoft Support to relax throttles on all environments that are used during migration.
Use a VM that is located geographically (or really with low latency and high throughput) to where the environments are hosted. This is a very common recommendation by Kingswaysoft too.
The settings for number of threads and batch size needs to be set based on some factors, namely:
Have throttles been relaxed
Size of payload (ie how many columns) – larger payload -> smaller batches.
Type of action – creates are faster than deletes. Updates are in-between.
I hope these tips can help you along. If you have any comments or you have other experiences in this subject, don’t hesitate to leave a comment.
During this migration and the writing of this article, I had excellent help from my highly intelligent colleague Ebba Linnea Nilsson and it is certainly true that two heads are better than one, and the end result is often a lot better than just the sum of two people. So for my final recommendation, make sure to have a good colleague with you to help you out, as you most probably will run into some issues and having someone to discuss with is really great!
“Do you really need to delete records like a Ferrari?” – that question was posed to me when I, a few years ago complained about the bad performance of the Bulk Deletion functionality in Power Platform (at that time Dynamics 365 Online) to a friend at Microsoft who I will not name. And my simple answer is yes, we do need to delete records like a Ferrari, for many reasons. I will discuss why in this article and I have for that reason also created an Idea on the Power Apps Community site on this subject and I hope that you agree with me and vote for it! You will find it on the link below.
So, why is a fast bulk deletion important. I would say there are several reasons and I will walk through the ones that I have thought of, if you have any other, please drop a comment.
Keep your data in check – remove unnecessary data
GDPR and other compliancy and legal issue
Power Platform growing into Citizen developer platform
Entitlements effectivly blocks using external tools
Keeping data in check
For larger organizations, especially with many integrated modules and systems, many running Flows, workflows, Customer Voice surveys etc. the system will generate a lot of data, especially if it is a B2C scenario. A few of these have built in features that automatically remove old logs etc but most don’t and we as admins and system caretakers (isn’t it a fancy title!) need to tend to this, typically by setting up jobs that clean old data. I would recommend looking at the PPAC statistics of which tables are the largest and having a practice of doing so at regular intervals and downloading it. That way you can see the trends over time. A suggestion for an addon to the CoE Starter kit would be a trend analysis of all tables with growth numbers per week for each tables with warnings for quickly growing tables and prognosis.
However, as instances start growing over 50-100 GB in size (of structured data) it soon becomes too large to handle the data with bulk deletion. Some tables might still be managable this way, but in general the performance is has is, when I have tried to measure it (albeit a few years ago) was around 1-3 records per second. A customer I have, working with B2C for whom I wanted to remove their Voice of the Customer, which had been used a lot, had over 50 Million Survey Invites. It is not possible to remove the solution without first removing the data, and if we were to use Bulk Delete and put it on crack and it got to 10 records per second, it would still take around 2 months. I now did it with SSIS/Kingswaysoft and it took a few days. If Bulk Delete could reach around 200 records/second, it would take a little less than 3 days.
I have also noted that when trying to Bulk Delete very large datasets, Bulk Delete simply fails, as I think the FetchXML query might do a SQL Timeout or something like that. Not exactly sure what happens. As it works with Kingswaysoft I don’t know what might be the difference.
GDPR and other compliancy and legal issues
As GDPR and other similar compliancy regulations have come into play in many countries around the world, it has become ever more important to stricly follow these detailed instructions. These might be simple when you look at them on a Power Point C-level perspective but when you dig down on the detailed level, where they actually need to be implemented, things seldom are as simple as in a Power Point.
Power Platform growing into Citizen developer platform
As the Power Platform grows from being just a platform on which Dynamics 365 is delivered to being a huge platform for digitalization entire organizations with almost 100% user saturation will be coming starting to use Dataverse. The amount of data being stored in dataverse will hence grow to massive amounts and hence an effective tool to manage this data is most important. It is probably even important to such a level that Bulk Delete cannot even scratch the top of the iceberg of what we need to be able to do on a data management perspective as data will be growing and expanding in heaps and bounds and admins will not only need to manage Flows and Apps but also data in size and content.
Entitlements effectivly blocks using external tools
The soon to enacted entitlements, as mentioned in my previous post, Entitlements are not throttling | Powerplatform.se, also effectivly stop the use of external tools like SSIS/Kingswaysoft for deleting unwanted data. One of the customers I am working with generate between 10-20 M API requests PER DAY, and the bulk of these are from deletion jobs or other maintainance jobs trying to keep track of the instances. With the new entitlements charge, there is no way this can be continued, but the customer is cought between a rock and a hard place as either the data grows by leaps and bounds or the API calls becomes a huge cost and there is no easy way to handle it. What advise am I to give the customer? I would think that the most reasonable thing would be if the platform made the tools available to maintain the data to avoid the costs. If this is using bulk delete or some other more elaborate feature, that is up to the product team but I do think they should hold off on activating the entitlements until there is a good alternative for managing an instance data within the platform before this (not generating API requests).
What else is missing?
Bulk deletion is not only not being performant enough, it also lacks the effective filtering logic that is required for more complex queries. For some customers a I have had to construct rather elaborate SSIS scripts which start with a complex FetchXML and the filter the data through several Cache Transforms, for instance with GDPR consents and similar to be able to get the final list. I must admit that I havn’t tried using the new T-SQL connector for this, that it could handle the full T-SQL complexity and that it is implemented in Buld Delete or Kingswaysoft as a means to make querying more powerful.
Microsoft recently (in February) published some updates to their documentation regarding Service protection API limits or as they are sometimes referred to, throttling. Some of these, like the new recommendations on how to handle batching are rather interesting and I thought I’d give my 2 cents about this. They are also eluding a bit regarding how the network infrastructure is set up for the deployment and how to optimize when handling larger workloads using the affinity cookie setting. I did find this rather interesting too.
In short, within a 5 minute sliding window, you cannot exceed the following for one specific user.
Not more than 6 000 requests.
Not more than 1 200 (20 minutes) execution time – equal to 4 parallell processes if running at full capacity
Not more than 52 request at the same time (concurrently).
Generally, if you do not use batching, you would typically run into the first (1) or the third if running unlimited threading. If using a connection pooling with 52 connections, then you probably run into the 1:st, and if you use complex request that cause cascading behaviours or batching, then you typically run into the second (2). There are exceptions that match these. Do refer to the official docs above for details about that.
Now to the interesting part. There is a new section that is attempting to tells us how to maximize throughput. First of all, I think this is great. We really need this and we need Microsoft to tell us how to not only use the platform, but how to efficiently use their platform.
“Let the server tell you how much it can handle” This section is interesting as it recommends a rather complex approach to how to work with performance. As they further down recommend using threading, they essentially recommend building logic that dynamically increases and decreases the number of threads as the platform informs you that it has capacity. This brings me back to university math, and trying to figure out the derivative of an unknown function by sampling and finding the local max. I do however, think this is a rather tall order to recommend to your average developer. But it would be a great community tool, so feel free to build it. Consider the challenge set. Best would be if Microsoft included this in the SDK of course.
“Use multiple threads“ In this part they recommend using multiple threads. This is also my experience that this is a good idea as the processing time and latency per package causes certain delay on a per-message basis. By utilizing multi-threading with multiple connections, this overhead can be reduced. As there is a limit of 52 concurrent connections, I would recommend using a maximum of that amount of connections/threads per user.
Avoid batching Now this is really interesting. The previous recommendation was to use batching to be “nicer” to the API and get increased performance. The recommendation now is the direct opposite. This is based on the fact that the overhead in a WebAPI JSON-message is significantly smaller than that in a SOAP message and that this will reduce the difference between using batching and non-batching. They do, however, recommend using smaller batch sizes still. This is also my experience when working with Kingswaysoft. I typically (it depends on the instance and which table I am using) start with 16 threads with batches of 10 or 20. This has typically given me the best performance, with performance of +300 records/s.
There is also a comment about the fact that the using batching does not bypass the entitlement limits, ie. 20 000 API calls/24 hours for an enterprise user/100 000 API calls for all non-licensed users and so on. See more on the Entitlement limits based on which license you have here. Hence this calculation is done by after exploding the batch on the servers. This is also news to me as I previously was told that batching was exactly the way to go to limit the amount of calls.
Removing the affinity cookie – server multiplexing The details being eluded to in this section are very interesting. If I understand it correctly, the logic is as follows:
The point being that, shutting off the affinity cookie int the HttpClient will allow for more wider use of all the servers in the node (the entire setup of all the Frontends, backends, NLB etc.)
What I do wonder, is if it would be possible to store the Affinity Cookie, and hence pool it on the client side. As each time you need to hit a new front end you will loose some time while it warms up your instance, and it would hence be advantageous to be able to more tightly control this. Maybe even this could be another community tool for someone interested? I also think, I havn’t tested this, that you will get better results when working with removed affinity cookie if you do use batching, at least until all the frontends have been warmed up to your instance.
Do also note a very important sentence; “This increases throughput because limits are applied per server“. We do not know how many servers are used in a node frontend, but probably more than 10. Removing the affinity cookie could hence increase performance by at least one order of magnitute.
User multiplexing As all API limits are calculated on a per-user basis, another way to increase performance is to use what I like to call user multiplexing. This means that operations are done using several different application users at the same time. There is of course some admin work that needs to be done to set these up, and there is no OOB way of doing this but with SSIS and Kingswaysoft it is rather straight forward; just create several connections, one per user, configure them per user and then use the “Balanced Data Distributor” which can be found in the productivity pack to spread the data to different destinations that are using the different connections.
My tips My tips for getting good performance, for large scale datasets, are hence the following based on these new facts:
Continue to use batching, but don’t use huge batches. Probably around 5-20 will be ok.
Use multithreading. I typically use around 16, but that was before I knew about the removal of the affinity cookie. Hence I would recommend 16 per server. But I cannot tell you how many servers there are.
Use the remove affinity cookie setting, and if possible, figure out some way of pooling the affinity cookies instead.
Make sure your application can handle the exceptions regarding the API-limit and have some reasonable strategy for working with them. I have found that blasting the API for 5 minutes at max speed, then backing off for 5 minutes, then going full throttle again for 5 minutes, has given me better throughput overall than “being nice” and just finding the “right” speed to use to not be throttled. Not sure this strategy will work in the long run though.
Use application user multiplexing.
Suggestions to ETL vendors and others My suggestions to ETL vendors and others who build connections to Dataverse that require high performance are:
Start by visualizing the affinity cookie setting so that it is possible to set this as wanted.
Include multithreading, batching and application user multiplexing into the standard dataverse connections.
Figure out if there are an points to pooling the affinity cookies, and if so, include this into the connection.
Make the connection auto-optimize with the data it is currently sending. Ie. how many threads, size of batches, size of affinity cookie pool and number of application users to utilize.
Have different strategies for utilizing application users instead of just spreading the data evenly, it could be that one is used until it receives an exception an then it is put on hold for 5 minutes and then another is being used. Or a combination of these two if there are five application users, 3 might be used for data transfer, and two on hold in case one gets an exception and needs to be put on hold.
I hope this has given you some insights and that my 2 cents got you this far. Feel free to leave a comment if you have an questions!