SQL Canuck: October 2015

Friday, 30 October 2015

Fact Table Column Types

I promised last week to continue the discussion of the various column types that are found in dimension and fact tables.

I explained the types of columns found in dimension tables and that blog can be found here: (http://mssqlcanuck.blogspot.ca/2015/10/dimension-table-column-types.html).

I also have a previous blog about the basics of dimensional modelling and it helps to explain what dimension and fact tables are. (http://mssqlcanuck.blogspot.ca/2015/08/dimensional-modeling-concepts.html)

Fact tables are really what make the dimensional model so efficient and quick for reporting and data analytics because it contains granular measurements of important business processes. The fact table combined with the dimension tables create an ideal environment for fast and accurate data analysis.

I mentioned in the blog about dimension column types that it can be over whelming to hear all the names of the data warehousing concepts such as lineage, slowly changing dimensions, surrogate keys, natural keys, composite keys, fact-less fact tables etc. I’ve tried to break down these concepts into smaller blogs so they are easier to digest.

Fact Table Column Types

These are the types of columns you will find in a fact table.

Measure – This is the essence of a fact table and they store measurements associate with a specific business process. They are usually numeric and can be aggregated. Examples are sales amount, order quantity, discount amount. These values can often be additive or semi-additive and when associated with relevant dimensions can be rolled up to represent aggregations at various granularity.

Foreign keys – These keys create a relationship between the dimension tables and the fact tables. Without the relationship to dimension tables the measures would be meaningless. I often use the example of 28kg of rice. 28 by itself is ambiguous but when you add kg (unit of measure dimension) and rice (product dimension) it becomes much more meaningful. The fact table is on the many side of the relationship. All the foreign keys in a fact table usually create a unique row in the fact table and can be considered a composite primary key.

Surrogate key – Sometimes you will include a surrogate key to act as the primary key. This key is usually the business key from the source table. For performance reasons is best if the key is short and consists of only 1 or 2 columns and preferably numeric. Often you could just use a collection of foreign keys that point to dimension tables to act as a composite primary key but this can be confusing and difficult to work with so sometimes it makes more sense to assign a surrogate key either from an identity or sequence database object or adopt one from the source system table.

Lineage columns – These columns are used for auditing purposes. Typically in a data warehouse environment the ETL loads are run nightly. You may want to keep track of which loads are successful or failed and also the load date so you know how current your data is in the warehouse or even which source system provided the data. This is known as lineage data and is typically not exposed to the report consumer but is very helpful for the DBA or ETL developers.

Example: FactSalesQuota fact table

Thanks for reading,

SQL Canuck

Friday, 23 October 2015

Identify and Resolve Orphaned SQL Server Database Users

In my experience as a SQL Server DBA, I found myself spending a lot of time making sure that our backup/restore strategy was in tip top shape. This involved spot checking backups and testing them on test servers to ensure that they were valid.

I'm reminded of a couple old DBA saying such as "Your data is only as good as your last backup." and "Your backup is only as good as your last restore." Testing that your backups will be useful in an disaster recovery situation is very important and essential for job preservation.

Also, as an operational support DBA for various LOB systems such as PeopleSoft and BizTalk, we would spend some time creating copy only backups and restoring them in test and development environments so the developers have a ‘refresh’ of test or development data.

A possible by-product of moving databases around to various environments is that sometimes it can result in an orphaned user.

An orphaned user is a user that has no corresponding SQL Server login. When you create a login so that it can be authenticated against the SQL Server instance you will also create a user and then associate the two together.

USE [AdventureWorks2012]

CREATE USER "domain\user1" FOR LOGIN "domain\login1";

If a SQL Server log in is dropped but the user is not then the user is orphaned.

USE [AdventureWorks2012]

Drop Login "domain\login1";

It is prudent to clean up these orphaned users from time to time to avoid confusion and for security reasons. It can be very time consuming and brain numbing to manually try to traverse the database objects in SSMS and identify and delete users. It would also be very risky because you could incorrectly identify and a user as orphaned and delete it and possibly cause serious problems.

There is a way to detect orphaned users in a database by running a system stored procedure. The system stored procedure is called sp_change_users and has an option where you provide @Action = ‘Report’ as a parameter.

USE [AdventureWorks2012]

GO;

sp_change_users_login @Action='Report';

GO;

The command will return the following result set:

You can also use the sp_change_user_login stored procedure to then associate the orphaned database user to a SQL Server login that you choose.

You can read more about this system stored procedure at:

https://msdn.microsoft.com/en-us/library/ms174378.aspx

Thanks for reading,

SQL Canuck

Thursday, 22 October 2015

Dimension Table Column Types

When you start to learn about BI and data warehousing it can be very daunting when you hear all the technical terms. It is very easy to get over whelmed when hearing about conformed dimensions, junk dimensions, fact-less tables, slowly changing dimensions, semantic layer, MOLAP, ROLAP and HOLAP (I think you get the idea).

It was my experience, coming from a traditional programmer/analyst back ground that used a relational model as a database back end, that most of these terms were unfamiliar. The good news is that with a little effort and study it doesn’t take that long to get familiar with it all.

A good way to get used to it all is to just break it down bit by bit and digest it slowly. I have previous posts that describe the use of dimension and fact tables as the foundation of star and snow flake schemas (http://mssqlcanuck.blogspot.ca/2015/10/the-fact-less-facttable-every-noob-to.html and http://mssqlcanuck.blogspot.ca/2015/08/dimensional-modeling-concepts.html).

I’ll use this post to briefly explain the types of columns that make up a typical dimension table. I find that once you understand the columns types (and I don’t mean data types like nvarchar, numeric or int) then it is much easier to understand the bigger picture.

Dimension Table Column Types

Dimension tables really describe that facts they are related to. For example, 16 could be the measure but it doesn't make sense if it isn’t related to a product dimension (RICE) or a unit of measure dimension (KGS).

These are the types of columns you will find in a dimension table.

Key columns – You will need to be able to uniquely identify each row in a dimension table so you will need key columns. You can use natural keys from the source system or surrogate keys (http://www.kimballgroup.com/1998/05/surrogate-keys/).

Name columns – After you identify a row with a unique key, you will need to be able to refer to the entity with a name. A report that only contains only keys won't read very well. For example, you could have a row in a customer dimension that has a key value but you will also want to have a human name to refer to the customer.

Attribute columns – This column is very important for performing BI analysis. Quite often in data analytics you will want to create pivot reports on discrete values such as gender or year. Columns in dimension tables that are discrete and make good candidates to perform pivot tables or graphs are called attribute columns.

Member properties – It wouldn’t make sense to pivot on something like a customer’s address because there would be way too many of them. However, things like addresses are important to provide additional information on a report. Columns like address are called member properties and are useful as labels on a report. It makes sense from a performance stand point to include these types of columns in a dimension table to avoid having to use distributed queries to the source system to get the information.

Lineage columns – These types of columns are used for auditing purposes. Typically in a data

warehouse environment the ETL loads are run nightly. You may want to keep track of which loads are successful or fail and also the load date so you know how current your data is in the warehouse or even which source system provided the data. This is known as lineage data and is typically not exposed to the report consumer but is very helpful for the DBA and ETL developers.

Example: Customer dimension table

I’ll do a similar post next week to describe the types of columns you will find on a fact table.

Thanks for reading,

SQL Canuck

Friday, 16 October 2015

UNION ALL and MERGE in SSIS

Recently, I needed to combine the results of two different queries into one existing dim table. Our third party vendor changed the way they performed a business requirement. Our existing SSIS ETL processes had already loaded data into the data warehouse based on the ‘old’ way of doing things but because now there was a new way of doing the same thing we needed to enhance our ETL to take into account the ‘new’ way and load the data into the same existing dim table.

I was able to write a SQL query that used a UNION ALL statement that would combine the old and new way of doing things and I could have put that SQL query into a simple Execute SQL Task control flow but it made more sense to represent the combination of the two queries in a graphic fashion so that the flow was more self-documenting and made more sense to the next person who has to deal with it.

I would put the old query in one OLE DB Source and the new query in another OLE DB Source and then combine them into one destination. It was easy enough to create the top two sources but when it came time to make the center combination of the two sources I realized there were two options. I could use the MERGE transformation or the UNION ALL transformation.

Example: Union All Transformation

Example: Merge Transformation

I have worked as a Data Warehouse Architect for several years and I have had the opportunity to use the MERGE transformation before but I have not worked with the UNION ALL in a while and I needed to remind myself what was the difference of the two since they appeared so similar.

After some research, it is really quite simple.

There are only 3 main differences between them.

Union All will output unsorted data so does not require sorted inputs.
Merge requires sorted inputs and will output sorted data.
Merge can only use 2 inputs. Union All can accept multiple inputs.

It is quite obvious from the list above that most significant item is how Merge can only use 2 inputs.

If you have more than 2 inputs that you need to combine then you must use Union All.

Union All is nicer to work with because you don’t have to mess around with ensuring the inputs are sorted which can require a SORT transformation or just having a SORT clause in your sql query.

Merge can be very fussy about how you set advanced properties such as IsSorted and SortKey position.

An important similarity between the two transformation is that both require the inputs and outputs to have identical metadata such as compatible data types.

In the end the results are identical so take case to know the differences and choose carefully depending on the nature of you goal.

Thanks for reading, SQL Canuck

Monday, 5 October 2015

The ‘Fact-less’ Fact Table

The ‘Fact-less’ Fact Table

Every ‘noob’ to data warehousing is aware of the simple concepts of classic star schema model.

I have a previous blog that describes it found here:

http://mssqlcanuck.blogspot.ca/2015/08/dimensional-modeling-concepts.html

In some of my recent research for other blog entries I came across a term that I have read about in the past but haven’t come seen in a while. I think terms like this come up when we as data professionals study the true concepts and theory of a subject but in our day to day work life they don’t come up as often. I’m speaking of the term ‘fact-less fact’ table.

Just as a review, we know the fact table as the table in the middle of the star schema that contains the measurable numeric fields as well as a series of foreign keys that point to the qualifying dimension tables. This allows us to roll up numeric additive values and relate them to other entities that give the aggregation a context that makes sense. For example: ‘28’ by itself doesn’t mean much but when you say ’28 kg of rice’ it makes a lot more sense. You are applying the dimensions of ‘unit of measure’ (kg) and ‘product’ (rice) to the fact field aggregate of ‘28’.

So what does all this have to do with a fact-less fact table?

A fact-less fact table is really just an intersection of dimensions. It contains none of the familiar additive numeric fields we are accustomed to seeing in a fact table. It consists only of foreign keys that point to dimension tables. It represents a many to many relationship between dimensions.

The classic example is that of a student information system that tracks attendance. Imagine a fact table called Student_Attendance and dimension tables called Date, Student, Location, Teacher and Class.

The event of a student attending a class doesn’t really need a recorded numeric representation in the fact table. Just by merit of a fact table containing foreign keys from the fact table to the dimension tables means that the student did attend the particular class on that particular day at that particular location. It just isn’t necessary to record a ‘1’ to represent the event. To record the numeric additive field would just increase the size of the fact table for no good reason.

Thanks for reading,

SQL Canuck