Orphan Customer Profile lookups

Orphan Customer Profile lookups

There are scenarios in CI-D when it is integrated with Dataverse where you may run into a bug with the platform. This is when one of the COLA (Contact, Opportunity, Lead or Account) table records has the lookup Customer Profile pointing at a customer profile that isn’t there any more. In other words an orphan link, something that typically should not exists. I will start by describing a scenario where this can occur, what problems it can cause and how to fix it, both short term by you and long term by Microsoft.

When does it occur?

This problem typically occurs if you have chosen to have more contacts in dataverse than you have in CI-D. The most common reason for this is that each 100k of unified profiles in CI-D incur an non-trivial licensing cost per month. Hence skipping the least valuable customers could be a way to reduce this cost. The problem, on a more technical note, is of course that you will have to filter the customers before they are ingested into CI-D. The new filtering functionality in CI-D that is about to be released will probably help with this.

So, let’s say for the sake of an example that we are only keeping customers in CI-D that placed an order with us during the last 24 months. This means that if you have a customer that yesterday was included, it could pass the threshold and be excluded today. Hence it was synchronized to dataverse yesterday with the rehydration functionality and the corresponding contact record was linked to the customerprofileid with the COLA Backstamping functionality. However, today, when the synchronization happened, this contact is no longer part of the unified profiles and hence is not synchronized to dataverse. However, the COLA backstamping functionality does not clean up after itself properly and hence the customerprofileid lookup on the contact record that was set yesterday, is still set BUT as there is no customer profile that corresponds to this anymore, there is an orphan link.

What problems can it cause?

Let’s say you have an integration that might not as specific in its data processing as is best, and hence it reads all fields from dataverse, changes some and then writes all fields back, instead of just the ones that were changed. If this happens to a record with an orphan lookup link then when you try to save the contact to dataverse, the platform will throw an exception saying that the customerprofileid is incorrect/not pointing at a real record.

How to fix it yourself

The easiest fix you can do is probably to write a Power Automate flow that loops through all contact records and verifies that the customerprofileid is set to something that actually exists.

Another fix is to rebuild the integration so that it only changes fields it actually has a change for. This, does require some extra coding, but that code could be rather generic so that you can use it more. This change will also make your code execute faster. It does, however, not actually fix the problem, just removes the symptom.

For field and table reference, below is a FetchXml that will return the contacts that have the lookup field set and one for all customer profiles. I have tried combining these in an outer left join to get the ones that are incorrectly linked, but there is a limitation in FetchXml that is stopping this.

<fetch version="1.0" output-format="xml-platform" mapping="logical" distinct="false">
  <entity name="contact">
    <attribute name="fullname" />
    <attribute name="contactid" />
    <attribute name="msdynci_lookupfield_customerprofile" />
    <filter type="and">
      <condition attribute="msdynci_lookupfield_customerprofile" operator="not-null" />
    </filter>
  </entity>
</fetch>
<fetch version="1.0" output-format="xml-platform" mapping="logical" distinct="false">
  <entity name="msdynci_customerprofile">
    <attribute name="msdynci_customerprofileid" />
  </entity>
</fetch>

My suggestions to Microsoft on how they should fix it

There are two different ways to solve this from a Microsoft side, one is on the CI-D integration side and the other on the dataverse elastic table side.

First of all, I think this is an incorrect implementation by the CI-D team. Elastic tables currently do not respect cascade rules like “remove link” and this should be a known fact. Hence they should implement a function that after the backstamping is done for all existing records, it should clean up any customerprofileid:s that do not have a corresponding record.

The second alternative solution is for the dataverse team to implement functionality to respect some of the cascade rules, in essence making sure the dataverse platform cleans up any links to records that do not exist. The positive side of this would be that this fix would help any lookup pointing to an elastic table, as this is most certainly not limited to just the CI-D integration, but the negative side is that this would affect the performance of elastic tables, especially during deletes.

Delete records like a Ferrari

Delete records like a Ferrari

“Do you really need to delete records like a Ferrari?” – that question was posed to me when I, a few years ago complained about the bad performance of the Bulk Deletion functionality in Power Platform (at that time Dynamics 365 Online) to a friend at Microsoft who I will not name. And my simple answer is yes, we do need to delete records like a Ferrari, for many reasons. I will discuss why in this article and I have for that reason also created an Idea on the Power Apps Community site on this subject and I hope that you agree with me and vote for it! You will find it on the link below.

Click here to go to the idea article and vote for faster bulk deletion

So, why is a fast bulk deletion important. I would say there are several reasons and I will walk through the ones that I have thought of, if you have any other, please drop a comment.

  1. Keep your data in check – remove unnecessary data
  2. GDPR and other compliancy and legal issue
  3. Power Platform growing into Citizen developer platform
  4. Entitlements effectivly blocks using external tools

Keeping data in check

For larger organizations, especially with many integrated modules and systems, many running Flows, workflows, Customer Voice surveys etc. the system will generate a lot of data, especially if it is a B2C scenario. A few of these have built in features that automatically remove old logs etc but most don’t and we as admins and system caretakers (isn’t it a fancy title!) need to tend to this, typically by setting up jobs that clean old data. I would recommend looking at the PPAC statistics of which tables are the largest and having a practice of doing so at regular intervals and downloading it. That way you can see the trends over time. A suggestion for an addon to the CoE Starter kit would be a trend analysis of all tables with growth numbers per week for each tables with warnings for quickly growing tables and prognosis.

However, as instances start growing over 50-100 GB in size (of structured data) it soon becomes too large to handle the data with bulk deletion. Some tables might still be managable this way, but in general the performance is has is, when I have tried to measure it (albeit a few years ago) was around 1-3 records per second. A customer I have, working with B2C for whom I wanted to remove their Voice of the Customer, which had been used a lot, had over 50 Million Survey Invites. It is not possible to remove the solution without first removing the data, and if we were to use Bulk Delete and put it on crack and it got to 10 records per second, it would still take around 2 months. I now did it with SSIS/Kingswaysoft and it took a few days. If Bulk Delete could reach around 200 records/second, it would take a little less than 3 days.

I have also noted that when trying to Bulk Delete very large datasets, Bulk Delete simply fails, as I think the FetchXML query might do a SQL Timeout or something like that. Not exactly sure what happens. As it works with Kingswaysoft I don’t know what might be the difference.

GDPR and other compliancy and legal issues

As GDPR and other similar compliancy regulations have come into play in many countries around the world, it has become ever more important to stricly follow these detailed instructions. These might be simple when you look at them on a Power Point C-level perspective but when you dig down on the detailed level, where they actually need to be implemented, things seldom are as simple as in a Power Point.

Power Platform growing into Citizen developer platform

As the Power Platform grows from being just a platform on which Dynamics 365 is delivered to being a huge platform for digitalization entire organizations with almost 100% user saturation will be coming starting to use Dataverse. The amount of data being stored in dataverse will hence grow to massive amounts and hence an effective tool to manage this data is most important. It is probably even important to such a level that Bulk Delete cannot even scratch the top of the iceberg of what we need to be able to do on a data management perspective as data will be growing and expanding in heaps and bounds and admins will not only need to manage Flows and Apps but also data in size and content.

Entitlements effectivly blocks using external tools

The soon to enacted entitlements, as mentioned in my previous post, Entitlements are not throttling | Powerplatform.se, also effectivly stop the use of external tools like SSIS/Kingswaysoft for deleting unwanted data. One of the customers I am working with generate between 10-20 M API requests PER DAY, and the bulk of these are from deletion jobs or other maintainance jobs trying to keep track of the instances. With the new entitlements charge, there is no way this can be continued, but the customer is cought between a rock and a hard place as either the data grows by leaps and bounds or the API calls becomes a huge cost and there is no easy way to handle it. What advise am I to give the customer? I would think that the most reasonable thing would be if the platform made the tools available to maintain the data to avoid the costs. If this is using bulk delete or some other more elaborate feature, that is up to the product team but I do think they should hold off on activating the entitlements until there is a good alternative for managing an instance data within the platform before this (not generating API requests).

What else is missing?

Bulk deletion is not only not being performant enough, it also lacks the effective filtering logic that is required for more complex queries. For some customers a I have had to construct rather elaborate SSIS scripts which start with a complex FetchXML and the filter the data through several Cache Transforms, for instance with GDPR consents and similar to be able to get the final list. I must admit that I havn’t tried using the new T-SQL connector for this, that it could handle the full T-SQL complexity and that it is implemented in Buld Delete or Kingswaysoft as a means to make querying more powerful.