Reducing the size of Sitecore Master DB

15 March 2022

When it comes to Sitecore development, an issue every developer has likely experienced is the size of the databases in relation to the size of their hard disk. In an ideal world production DBs would contain production data, test environments would have data ideally suited for testing and developer workstation would have the minimum required to develop the solution.

In reality though, I think most people have the experience of everything being a copy from production. This ends up being the case due to a clients requirements that UAT needs to looks the same as prod, QA needs prod content to replicate a bug and although critical Sitecore items may have been serialized, not having any content in your local makes it a bit hard to dev.

When a website is new and doesn't have much content this isn't a huge issue, but when you inherit one 5 years old with a 25gb DB, things start to become a problem. Not only is the hard disc space required an issue, but just setting a new developer up takes hours from download times.

After getting a new laptop and being faced with the challenge of needing to copy multiple DBs (and not even having enough space to back them up on my existing machine), I decided to finally do something about reducing the size of them,

Removing old item versions

Having a history of item versions is a great feature for content editors, however as a dev I don't really need them on my local. They also hold references to media items that aren't used any more.

This Sitecore Powershell script from Webbson does exactly that and even lets your specify how many version you want to keep. I went for 1.

1<#
2  This script will remove old versions of items in all languages so that the items only contains a selected number of versions.
3#>
4
5$item = Get-Item -Path "master:\content"
6$dialogProps = @{
7    Parameters = @(
8        @{ Name = "item"; Title="Branch to analyse"; Root="/sitecore/content/Home"},
9        @{ Name = "count"; Value=10; Title="Max number of versions";  Editor="number"},
10        @{ Name = "remove"; Value=$False; Title="Do you wish to remove items?"; Editor="check"}
11    )
12    Title = "Limit item version count"
13    Description = "Sitecore recommends keeping 10 or fewer versions on any item, but policy may dictate this to be a higher number."
14    Width = 500
15    Height = 280
16    OkButtonName = "Proceed"
17    CancelButtonName = "Abort"
18}
19
20$result = Read-Variable @dialogProps 
21
22if($result -ne "ok") {
23    Close-Window
24    Exit
25}
26
27$items = @()
28Get-Item -Path master: -ID $item.ID -Language * | ForEach-Object { $items += @($_) + @(($_.Axes.GetDescendants())) | Where-Object { $_.Versions.Count -gt $count } | Initialize-Item }
29$ritems = @()
30$items | ForEach-Object {
31    $webVersion = Get-Item -Path web: -ID $_.ID -Language $_.Language
32    if ($webVersion) {
33        $minVersion = $webVersion.Version.Number - $count
34        $ritems += Get-Item -Path master: -ID $_.ID -Language $_.Language -Version * | Where-Object { $_.Version.Number -le $minVersion }
35    }
36}
37if ($remove) {
38    $toRemove = $ritems.Count
39    $ritems | ForEach-Object {
40        $_ | Remove-ItemVersion
41    }
42    Show-Alert "Removed $toRemove versions"
43} else {
44    $reportProps = @{
45        Property = @(
46            "DisplayName",
47            @{Name="Version"; Expression={$_.Version}},
48            @{Name="Path"; Expression={$_.ItemPath}},
49            @{Name="Language"; Expression={$_.Language}}
50        )
51        Title = "Versions proposed to remove"
52        InfoTitle = "Sitecore recommendation: Limit the number of versions of any item to the fewest possible."
53        InfoDescription = "The report shows all items that have more than <b>$count versions</b>."
54    }
55    $ritems | Show-ListView @reportProps
56}
57
58Close-Window

Removing unpublished items

After a few failed attempts at reducing the size of the DB's, I discovered that the content editors working on the website had seemingly never deleted any content. Instead that had just marked things as unpublishable. I can see the logic in this, but after 5+ years, they have a lot of unpublished content filling up the content tree.

Well if it's unpublished I probably don't need it on my local machine so lets delete it.

Here's a script I wrote, the first part removes items set to never publish. After running just this part I found lots of the content items had the item set to publish but the version set to hidden. The second part loops through versions on items and removes any version set to hidden. If the item has no version left then it is removed too.

1
2
3// Remove items set to never publish
4Get-ChildItem -Path "master:\sitecore\content" -Recurse | 
5Where-Object { $_."__Never publish" -eq "1" } | Remove-Item -Recurse -Force -Confirm:$false
6    
7// Loop through items and remove versions set to never publish, then remove the item if it has no versions left
8foreach($item in Get-ChildItem -Path "master:\sitecore\content" -Recurse) {
9 
10 $item
11 foreach ($version in $item.Versions.GetVersions($true))
12 {
13     $version
14        $version."__Hide version"
15        if ($version."__Hide version" -eq "1" ) {
16            $version| Remove-ItemVersion -Recurse  -Confirm:$false
17        }
18 }
19 
20 if ($item.Versions.GetVersions($true).count -eq 0) {
21     $item | Remove-Item -Recurse -Force -Confirm:$false
22 }
23}

Remove dead links

In the next step I rebuild the links DB, but I kept ending up with entries in the link table with target items that didn't exist. After a bit of searching I came across an admin page for clearing up dead links.

/sitecore/admin/RemoveBrokenLinks.aspx

With this page you can remove all those pesky dead links caused by editors deleting items and leaving the links behind.

Clean Up DBs

With our content reduced the DB's now need a clean up before we do anything else.

In the admin section there is a DB Cleanup page that will let you perform various tasks on the DB. I suggest doing all of these.

/sitecore/admin/DBCleanup.aspx

Once this is done navigate to the control panel and rebuild the link database. From the control panel you can also run the clean up database script, but it won't give you as much feedback.

/sitecore/client/Applications/ControlPanel.aspx?sc_bw=1

Remove unused media

With all the old versions/items/dead links removed and the DB's cleaned up its time to get rid of any unused media items. It's likely if you have a large DB that most of the space will be taken up by the media items. Fortunately with another PowerShell script we can removed any media that isn't linked too.

This PowerShell script is an adapted version of one by Michael West. You can find his version here https://michaellwest.blogspot.com/2014/10/sitecore-powershell-extensions-tip.html?_sm_au_=iVVB4RsPtStf5MfN

The main difference is I've been more aggressive and removed the checks on item owner and age.

1filter Skip-MissingReference {
2    $linkDb = [Sitecore.Globals]::LinkDatabase
3    if($linkDb.GetReferrerCount($_) -eq 0) {
4        $_
5    }
6}
7
8$items = Get-ChildItem -Path "master:\sitecore\media library" -Recurse | 
9    Where-Object { $_.TemplateID -ne [Sitecore.TemplateIDs]::MediaFolder } |
10    Skip-MissingReference
11
12if($items) {
13    Write-Log "Removing $($items.Length) item(s)."
14    $items | Remove-Item
15}

Shrink databases

Lastly through SQL management studio, shrink your database and files to recover unused space you hopefully now have from removing all of that media.

In my case I was able to turn a 20+ GB database into a 7 GB database by doing these steps.

If your local is running with both web and master DB, you should now do a full publish. The item versions which are published should stay exactly the same as we only removed items set to not publish. You should however get a reduction in your web DB from the media items being removed.

Deploying a SQL DB with Azure Pipelines

12 April 2021

Normally when I work with SQL Azure I handle DB schema changes with Entity Framework migrations. However if you using Azure Functions rather than Web Jobs it seems there's a number of issues with this and I could not find a decent guide which resulted in a working solution.

Migrations isn't the only way to release a DB change though. SQL Server Database projects have existed for a long time and are a perfectly good way of automating a DB change. My preference to use EF Migrations really comes from a place of not wanting to have an EF model and a separate table scheme when they're essentially a duplicate of each other.

Trying to find out how to deploy this through Azure Devops Pipelines however was far harder than I expected (my expectation was about 5 mins). A lot of guides weren't very good and virtually all of them start with Click new pipeline, then select Use the classic editor. WAIT Classic Editor on an article written 3 months ago!?!?! Excuse me while I search for a solution slightly more up to date.

Creating a dacpac file

High level the solution solution is to have a SQL Server Database project, use an Azure Pipeline to compile that to a dacpac file. Then use a release pipeline to deploy that to the SQL Azure DB.

I'm not going to go into any details about how you create a SQL Server Database project, its relatively straightforward, but the one thing to be aware of is the project needs to have a target platform of Microsoft Azure SQL Database otherwise you'll get a compatibility error when you try to deploy.

Building a SQL Server Database project in Azure Devops

To build a dacpac file create a new pipeline in Azure Devops (the yaml kind), select your repo and get yourself a blank configuration file. Also at this point make sure your code is actually in the repo!

The configuration I used looks like this; I've included notes in the code to explain what's going on.

1# The branch you want to trigger a build
2trigger:
3- master
4
5pool:
6  vmImage: "windows-latest"
7
8variables:
9  configuration: release
10  platform: "any cpu"
11  solutionPath: # Add the path to your Visual Studio solution file here
12
13steps:
14  # Doing a Visual Studio build of your solution will trigger the dacpac file to be created
15  # if you have more projects in your solution (which you probably will) you may get an error here
16  # as we haven't restored any nuget packages etc. For just a SQL DB project, this should work
17  - task: VSBuild@1
18    displayName: Build solution
19    inputs:
20      solution: $(solutionPath)
21      platform: $(platform)
22      configuration: $(configuration)
23      clean: true
24
25  # When the dacpac is built it will be in the projects bin/configuation folder 
26  # to get into an artifact (probably with some other things you want to publish like an Azure function)
27  # we need to move it somewhere else. This will move it to a folder called drop
28  - task: CopyFiles@2
29    displayName: Copy DACPAC
30    inputs:
31      SourceFolder: "$(Build.SourcesDirectory)/MyProject.Database/bin/$(configuration)"
32      Contents: "*.dacpac"
33      TargetFolder: "$(Build.ArtifactStagingDirectory)/drop"
34
35  # Published the contents of the drop folder into an artifact
36  - task: PublishBuildArtifacts@1
37    displayName: "Publish artifact"
38    inputs:
39      PathtoPublish: "$(Build.ArtifactStagingDirectory)/drop"
40      ArtifactName: # Artifact name goes here
41      publishLocation: container

Releasing to SQL Azure

Once the pipeline has run you should have an artifact coming out of it that contains the dacpac file.

To deploy the dacpac to SQL Azure you need to create a release pipeline. You can do this within the build pipeline, but personally I think builds and releases are different things and should therefore be kept separate. Particularly as releases should be promoted through environments.

Go to the releases section in Azure Devops and click New and then New release pipeline.

There is no template for this kind of release, so choose Empty job on the next screen that appears.

On the left you will be able to select the artifact getting built from your pipeline.

Then from the Tasks drop down select Stage 1. Stages can represent the different environments your build will be deployed to, so you may want to rename this something like Dev or Production.

On Agent Job click the plus button to add a task to the agent job. Search for dacpac and click the Add button on Azure SQL Database deployment.

Complete the fields to configure which DB it will be deployed to (as shown in the picture but with your details).

And that's it. You can now run the pipelines and your SQL Project will be deployed to SQL Azure.

Some other tips

On the Azure SQL Database deployment task there is a property called Additional SqlPackage.exe Arguments this can be used to specify things like should loss of data be allows. You can find the list of these at this url https://docs.microsoft.com/en-us/sql/tools/sqlpackage/sqlpackage?view=sql-server-ver15#properties

If you are deploying to multiple environments you will want to use variables for the server details rather than having them on the actual task. This will make it easier to clone the stages and have all connections details configured in one place.

Data Factory: How to upsert a record in SQL

9 March 2021

When importing data to a database we want to do one of three things, insert the record if it doesn't already exist, update the record if it does or potentially delete the record.

For the first two, if your writing a stored procedure this often can lead to a bit of SQL that looks something like this:

1IF EXISTS(SELECT 1 FROM DestinationTable WHERE Foo = @keyValue)
2BEGIN
3  UPDATE DestinationTable
4  SET Baa = @otherValue
5  WHERE Foo = @keyValue
6END
7ELSE
8BEGIN
9  INSERT INTO DestinationTable(Foo, Baa)
10  VALUES (@keyValue, @otherValue)
11END

Essentially an IF statement to see if they record exists based on some matching criteria.

Data Factory - Mapping Data Flows

With a mapping data flow, data is inserted into a SQL DB using a Sink. The Sink let's you specify a dataset (which will specify the table to write to), along with mapping options to map the stream data to the destination fields. However the decision on if a row is an Insert/Update/Delete must already be specified!

Let's use an example of some data containing a persons First Name, Last Name and Age. Here's the table in my DB;

And here's a CSV I have to import;

1FirstName,LastName,Age
2John,Doe,10
3Jane,Doe,25
4James,Doe,50

As you can see in my import data Jane's age has changed, there's a new entry for James and Janet doesn't exist (but I do want to keep here in the DB). There's also no ID's in my source data as that's an identity created by SQL.

If I look at the Data preview on my source in the Data Flow, I can see the 3 rows from my CSV, but notice there is also a little green plus symbol next to each one.

This means that they are currently being treated as Inserts. Which while true for one of them is not for the others. If we were to connect this to the sink it would result in 3 new records being added to the DB, rather than two being updated.

To change the Insert to an update you need an alert row step. This allows us to define rules to state what should be an insert and what should be an update.

However to know if something should be an insert or an update requires knowledge of what is in the DB. To do that would mean a second source, followed by a join on First Name/Last Name and then conditions based on which rows have an ID from the DB or not. This all seems a bit needlessly complicated, and it is.

Upsert

When using a SQL sink there is a 4th option for what kind of method should be used and that is an Upsert. An upsert will result in a SQL merge being used. SQL Merges take a set of source data, compare it to the data already in the table based on some matching keys and then decide to either update or insert new records based on the result.

On the sink's Settings tab untick Allow insert and tick Allow upsert. When you tick Allow upsert properties for Key columns will appear which is where you specify which columns should be used as a key. For me this is FirstName and LastName.

If you don't already have an Alter Row step it will warn you that this is missing.

Even though we are only doing what equates to a SQL merge, you still need to alter the rows to say they should be an upsert rather than an insert.

As we are upserting everything our condition can just be set to return true rather than analysing any row data.

And there we have it, all rows will be treated as an upsert. If we look at the Data preview we can now see the upsert icon on each row.

And if we look at the table after running the pipeline, we can see that Janes age has been update, James has been added and John and Janet stayed the same.