Dinesh's Blog :::: Being Compiled ::::: DocumentDB

Wednesday, October 28, 2015

Loading data from Microsoft SQL Server to DocumentDB in Azure

WIth my previous post (How to create a DocumentDB in Azure), I mentioned that there are multiple ways of loading documents into DocumentDB in Azure. This is the continuation of it, let's see how we can load data from SQL Server into DocumentDB using Data Migration Tool. Data Migration Tool is a free open-source tool that can be used for loading data into DocumentDB from many different sources. It converts loaded data into JSON hierarchical documents and supports loading data from Microsoft SQL Server, MongoDB, JSON, CSV, Azure Table Storage and many more.

The Data Migration Tool can be downloaded from Microsoft Download Center. However there is an easy way, link for the page is available with DocumentDB account blade;

Go to the page and download the compressed file. Once files are extracted, you will see two executable files; dt.exe and dtui.exe.

Click the dtui.exe for opening the tool. As usual, welcome screen appears, and next screen for selecting the source. Since we try to load data from SQL Server, select SQL under Import from. You need to manually add the connection string (you can get it generated by creating a UDL file), make sure it is added without the provider.

Let's try to load data from AdventureWorksDW2014 database. This is the query for loading customers with locations.

USE AdventureWorksDW2014;
GO

SELECT
 c.CustomerKey AS [key]
 , c.FirstName + ' ' + c.LastName AS [Name]
 , c.AddressLine1 AS [Location.Street]
 , g.City AS [Location.City]
 , g.StateProvinceName AS [Location.Province]

FROM dbo.DimCustomer c
 INNER JOIN dbo.DimGeography g
  ON c.GeographyKey = g.GeographyKey;

Examine the query. Note the way how aliases have been set, specifically columns related to the address. If you need a hierarchy inside the document, this is the way of instructing the tool for making them. As per this query, a document will contain properties like key, name, and location, and location will have three more properties; Street, City and Province.

Once the query is placed, write dot (".") for Nesting Separator for instructing the tool that dot is used for separating nested properties.

Export to
Next screen for setting the target. There are multiple options for Export to, since this loads only 18484 documents, we can use DocumentDB - Bulk Import rather selecting other items such as DocumentDB - Sequential record import and JSON file. This creates a stored procedure in the collection and uses for loading documents. It is recommended to use second option which is DocumentDB - Sequential record import for large set of documents because it is faster than bulk load. The third option JSON file does not load documents to DocumentDB but creates JSON files.

Connection string
Setting the connection string is easy. All we have to do is, get the copy of primary connection string from keys in documentdb account blade and add the database name to it. Note that database name has to be manually added.

Collection
We can load documents to an existing collection or create a new one. If the one we added with the screen is not exist, tool will create a one for us with the selected performance tier.

Id field
If you leave the Id input box blank, it will create one for us, else we can instruct to use one of our columns for that. As you see, I instruct to use Key column for Id of the documents.

Indexing policy
We can add a policy that explains how we search documents and what type of index we prefer. We do not need to write it but can select the required index by right-clicking on it for getting the menu and selecting one. There are three options;

Default - Efficient indexing only for numeric supporting equality and range check and sorting. String searches cause scan and no support on sorting.
Hash - Support only scanning for both numeric and string, and no support on sorting.
Range - Efficient indexing for both numeric and string including sorting but may use extra space and resources.

Now it is ready for loading, go through the summary and complete it.

Documents are loaded, we can check in the collection whether documents are loaded as we instructed.

Tuesday, October 27, 2015

How to create a DocumentDB in Azure

Microsoft implementation of NoSQL database, documentDB is a cloud based, fully managed, massively scalable service for managing semi-structured data in Azure. Just like other NoSQL products, it supports schema-free JSON documents and unlike other NoSQL platforms, supports indexing automatically on all properties in the document as soon as it is added. Properties in added documents are easily query-able using SQL (Not exactly SQL for relational databases) and any property in the hierarchy is searchable.

This post speaks about creating documentDB in Azure with simple set of steps. If you are not familiar with NoSQL, read this post: What is NoSQL Document Database? What is Azure DocumentDB?

Let's create a documentDB. Login to Azure with new portal (https://portal.azure.com/) using your Azure account. Then Browse for DocumentDB Accounts and click on it for opening the blade related to DocumentDB creation.

Once it is opened, firth thing you need to do is, creating an account. Click on Add and enter a name that is globally unique for DocumentDB Account.

DocumentDB Account is created and it is available as .documents.azure.com. This still does not have a database (see image), for creating a database under the account created, click on Add Database button, and give a name for your database. As you see, I have named it as db01.

Once the database is created, you should see it in the Databases grid in the DocumentDB Account blade. Every component in documentDBs has a ResourceID and every documentDB needs at least one Collection.

What is a Collection?

Collection is 10GB of storage that can be created inside the database. It is not only for handling storage but it determines how much you pay for Microsoft. DocumentDB pricing is based on Collections and price is getting changed based on the Pricing Tier assigned to the collection. It is hourly-basis billing and scaling up and down is done by adding or removing collection assigned.

You can get the database blade opened by clicking the name of the database created and add Collection by clicking the Add Collection button. Add Collection blade allows you to enter the name of the collection, select the Pricing Tier, and select the Indexing Policy.

What is Request Units per Second?

As you see, Collection can be assigned to one Pricing Tier out of three; S1, S2, S3. Storage offered with each tier is same but RU different. RU, Request Units per Second simply means how many requests can be performed against the Collection per second. It can be reading requests, can be writing requests but the counting is not exactly the number of requests made but the throughput for the request considering usage of the resources such as CPU, memory and IO operations. As I read, generally a read-request lesser than 1KB requires one RU but a write-request with a same size of document requires more than RU as writing cost is higher than reading cost. A complex request that needs to update documents with many indexes may need more RUs. How do you determine the required RU? It is possible by looking at dashboards given with the blade, it shows all requests made and the RUs used for requests.

Here is an image explains model:

Image was taken from DocumentDB hierarchical resource model and concepts. Read it for more info.

You can start with the tier as you want for the Collection. Once the tier is selected, Indexing Policy has to be selected, it has two policies; Default and Range. The Default is good for properties with numbers and Range is best for both string and numbers but the storage cost is high. For this, let's create a Collection with S1 and Range policy.

Now we have a database with a collection. All Collections are appeared in the database blade with their ResourceID and can be opened by clicking on them. There are many ways of connecting with the database, adding documents and querying documents;

Using the portal itself - Manually add one by one or upload upto 100 documents at a time.
Using DocumentDB migration tool
Programmatically using REST API (difficult and complex), SDK (like .NET, flexible and can do almost all)
Using Stored Procedures and Triggers added in the collection.

Let's use the given interface for adding and querying for testing. Click on the Create Document button in the Collection blade (in my case it is Customer). This opens another blade that can be used for writing a document. As you see, it adds Id automatically which is required. If you do not add an Id to your document, it will create one for the document. Here is my document;

Once it is added, it can be viewed using Document Explorer in database blase as below.

Let's see how this can be done using other ways with next post related to DocumentDB.