So you want to know more about the performance of your employees and your customers and run statistics on sensitive information. Well, if you or your HR Team aren’t data scientists, you got a problem.
But when you create your database definition like the following, your engineers can create reports and analyses, without being exposed to the actual sensitive data.
Using the SQL Server feature of masked columns, combined with clever rights and access management lets users work with the data but not read the actual data.
Using SQL Server Database Projects and Azure DevOps lets you take security to another level, by working even without sensitive data in the dev environment and running the analysis on private databases.
The basic architecture of a simple data layer
Let’s assume you have two systems containing the sensitive data you want to run your reporting on, one in a SaaS and one in a DB in your network. You can use Azure Data Factory and Azure VNets to securely ingest your data into the database.
If you want to know more about, how to create the infrastructure using Terraform, let me know in the comments.
Create an Azure SQL Database project
In this article, the focus is on database definition. We’ll use Azure Data Studio to set up a Database definition project. If you want to know more about how to set up a project, have a look at SQL Server Dev Ops Style
In our Database Project, we add a Folder Schema
with three scripts inside.
- Schema Definition for our SystemA
- Schema Definition for our SystemB
- Schema Definition for our Data Integration Layer
Next, we add a Folder for every Schema and create a Table in the SystemA Schema and a table in the SystemB Schema. These are the tables that have masked columns because they will contain sensitive data. A masked column is basically a column where a query will not return the actual value of the field but be replaced with dummy values or only partially visible values.
CREATE TABLE [StageA].[Employee] (
[EmployeeId] INT NOT NULL PRIMARY KEY
, [sex] NVARCHAR(6) MASKED WITH (FUNCTION = 'partial(0,"x",2)')
, [FirstName] NVARCHAR(50) MASKED WITH (FUNCTION = 'default()')
, [LastName] NVARCHAR(50) MASKED WITH (FUNCTION = 'default()')
, CONSTRAINT chk_sex CHECK ([sex] IN ('female', 'male', 'divers'))
CREATE TABLE [StageB].[Salaries]
[Id] INT NOT NULL PRIMARY KEY
,[EmployeeID] INT NOT NULL
,[Period] NVARCHAR(6) NOT NULL
, [Salary] NUMERIC(15,4) NOT NULL DEFAULT(0.0)
Types of masks
There are currently 5 types of masks available. The default mask and the email mask will replace the actual value with an ‘xxx’ string and leave just a portion of the value visible.
The random mask replaces any numeric values (Int, BigInt, Numeric, etc.) with a random number in the given range.
The DateTime mask can be used to mask date type values, or only parts of the date-time value, like the year, day, or hour.
Last but not least there is the custom mask, which allows you to replace a value with a user-defined string, leaving a leading and trailing number of characters visible.
Finally, in our database project, we add a procedure to integrate the data and compute average salaries per gender.
Roles to use
Now a db_owner, which you are, if you created the SQL Database has permission to unmask data, but we want our database model developers to explicitly not have the permission to unmask data. Hence, they can’t be part of the db_owner role.
Let’s assume we have a developer John Doe, with the AAD principle email@example.com the following will give John enough power on the DB to develop reporting solution, while not being able to unmask the data.
CREATE USER [firstname.lastname@example.org] FROM EXTERNAL PROVIDER;
ALTER ROLE [db_ddladmin] ADD MEMEBER [email@example.com];
ALTER ROLE [db_datawriter] ADD MEMEBER [firstname.lastname@example.org];
ALTER ROLE [db_datareader] ADD MEMEBER [email@example.com];
GRANT CONNECT TO [firstname.lastname@example.org];
GRANT VIEW DEFINITION TO [email@example.com];
GRANT EXECUTE ON SCHEMA::[Intergration] TO [firstname.lastname@example.org];
Be aware that data masking is not a severe security measure, but rather a convenient function for your developers. For example, it is still possible to get the salary of users by querying for a very small range
SELECT * FROM StageB.Salaries WHERE salary = 100000
Also, every user with the UNMASK permission or the CONTROL permission, like members of the role db_owner, is able to unmask data.
Please also be aware that masking data is not possible with encrypted columns.
Suppose you really want to make sure that your developers are not able to hack their way toward the information. In that case, you should consider having a development database, with dummy data, and a production database with the actual data.
Also consider using row-level security, to make only certain parts of data available to users allowed to receive the data. If you like to know more about row-level security, let me know in the comments.
Using the data masking functionality of SQL Server is pretty convenient for data scientists and engineers, to work with the actual data, without being able to see details of the actual data.
Nevertheless, data masking is not a real security feature to prevent data leakage, since there are many ways to circumvent the masking and get the actual data.
If you need a full example project, please find it on GitHub.
If this was helpful, some claps and a subscribe would be nice.