Dinesh's Blog :::: Being Compiled ::::: Bulk Operations

Showing posts with label Bulk Operations. Show all posts

Friday, August 4, 2017

bcp error - Invalid column type from bcp client for colid 1 - Solution

You might be experiencing the following error with bcp when loading text data to SQL Server table;

It is possible to see this error when there is a data type mismatch between a column related to the destination table and a field coming from the file. However, there can be a situation where you have used a view for the bcp and underline table has been changed. If so, you may see the same error.

In my case, my destination table has an identity column, hence I created a view for getting data via bcp. I started seeing this error after making some modifications to the table. Once the view is dropped and recreated, it worked without any issue.

Monday, July 31, 2017

SQL Server bcp error - String data, right truncation

We generally expect a proper row terminator with files loaded using bcp and with all accurately defined switches, data should be loaded without any issue. But sometime, you may see issues like below even if you try with switches like -r "\r\n" or -r "\n".

In most cases, the reason is the row terminator. If you open the file to be loaded using something like Notepad++, and enable Show All Characters (icon is in the toolbox), you should see the exact row terminator.

If you see something like above, then you need to add the row terminator as a hexadecimal value. In this case, if I add -r "0x0a", then my data loading will work without any issue.

Sunday, August 16, 2015

Differences between BCP and BULK INSERT

It is not uncommon to see importing data from other sources and exporting data into other sources as not all data can be entered row-by-row. Generally, either import or export, it can be considered as an ETL operation and some involve with transformation (T) and some do not. If the requirement is just loading a bulk data into SQL Server without transformation, easiest and fastest way of loading data is either using bcp or BULK INSERT. What is the best? Here is a comparison.

Operation	bcp	BULK INSERT
Type of utility	Command line utility	T-SQL utility
Usage	for both import and export	Import only
Required TSQL knowledge	No, unless queryout is used for writing a query without mentioning the table name.	Required.
Constraint enforcement	Does not check CHECK and FOREIGN KEY constraints unless CHECK_CONSTRAINTS is used with -h hint. UNIQUE, PRIMARY KEY, and NOT NULL constraints are always enforced.	Does not check CHECK and FOREIGN KEY constraints unless CHECK_CONSTRAINTS is used. UNIQUE, PRIMARY KEY constraints are always enforced. Inserts blank to NOT NULL character type columns if no value is passed to the column.
Explicit transaction	Not possible	Possible
Format file	Can generate and use	Cannot generate but can be used.
Source location	Can be referred a location in client ENV	Can be refereed only a location within the server.
Performance	Fast	Fastest
Protocol	TCP/IP Shared Memory Named Pipes	In memory

Friday, August 14, 2015

Can I disable all constraints when loading data?

Generally, when loading a large data set, we disable constraints and drop indexes for speeding up the loading process. Constraints help us to enforce integrity rules making sure that no inconsistent data is exist though it slows down update operations on the table. There are different types of constraints for addressing different types of integrity such as;

Entity integrity - referring row level integrity
Domain integrity - referring column level integrity
Referential integrity - referring integrity with relationships

In addition to that, constraints can be used for addressing user-defined integrity too.

SQL Server supports following constraints types;

PRIMARY KEY constraint
UNIQUE KEY constraint
FOREIGN KEY constraint
CHECK constraint
NOT NULL constraint
DEFAULT constraint

However, note that we cannot disable all constraints. Only CHECK and FOREIGN KEY constraints can be disabled. For disabling other constraints, either constraints have to be modified or dropped. For example, if you need to disable either PRIMARY KEY or UNIQUE KEY constraint, index attached for them should be dropped.

Here is the code for disabling CHECK constraint.

-- Create a table with a check constraint
USE tempdb;
GO

CREATE TABLE dbo.Customer
(
 CustomerId int not null
 , CustomerName varchar(10) not null
 , CreditAmount decimal(18,2) not null
 , CONSTRAINT Customer_SalaryCheck CHECK (CreditAmount < 100000)
);
GO

-- Inserting valid data
INSERT INTO dbo.Customer
 VALUES (1, 'Joe Brown', 65000);
INSERT INTO dbo.Customer
 VALUES (2, 'Mary Smith', 75000);


-- Inserting invalid data
-- This will not be possible as it violates the constraint
INSERT INTO dbo.Customer
 VALUES (3, 'Joe Brown', 110000);

-- Disabling the constraint and inserting the same again
-- Now it allows to insert a row with an invalid value
ALTER TABLE dbo.Customer NOCHECK CONSTRAINT Customer_SalaryCheck;
GO

INSERT INTO dbo.Customer
 VALUES (3, 'Joe Brown', 110000);
GO

-- Enabling the constraint.
-- Note that updated records are not validated
ALTER TABLE dbo.Customer CHECK CONSTRAINT Customer_SalaryCheck;
GO
-- Now table contains invalid data.

Note that it does not validate records inserted when enabling it again. If you want to check all records for integrity when enabling, use following code.

-- If you want to make sure no invalid data exist
-- and validate all records agains the constraint
-- disable it again and enable it with CHECK option
ALTER TABLE dbo.Customer NOCHECK CONSTRAINT Customer_SalaryCheck;
GO

ALTER TABLE dbo.Customer WITH CHECK CHECK CONSTRAINT Customer_SalaryCheck;
GO
-- This will throw and error because current rowset has an invalid value.
-- This can be corrected by either updating it or deleting it.
DELETE FROM dbo.Customer WHERE CustomerId = 3;

-- Now constraint can be enabled with CHECK option
ALTER TABLE dbo.Customer WITH CHECK CHECK CONSTRAINT Customer_SalaryCheck;
GO
-- And if you run sys.check_constraint again, you will see 0 for is_not_trusted column.

DROP TABLE dbo.Customer;

Sunday, June 28, 2015

Indexes should be dropped before a bulk insert or not

If we are loading a large volume of data into an indexed table, we always drop the index, load the dataset, and re-create indexes assuming that the overhead of dropping and re-creating indexes is less than the overhead of loading data with the indexes in place. But for certain situations, this assumption may not give any benefits. When loading a small dataset, dropping and re-creating may be counterproductive, and may take more time for re-creating than the time it takes for loading data with indexes in place.

Considering that, how do we determine whether we should drop the indexes before loading or insert all records with indexes in place?

Microsoft has given set of guidelines for this. It is based on minimal amount of new data to be loaded, proportion to the total data in the table. If your new data load is equal or greater than the percentage given, it is recommended to drop the indexes before loading and re-create them after loading.

Indexes	Relative amount of new data
Clustered index only	30%
Clustered and one nonclustered index	25%
Clustered and two nonclustered indexes	25%
Single nonclustered index only	100%
Two nonclustered indexes	60%

For example, if you have 100,000 records in the table that has a clustered index only and have to load 30,000 new records, it is better to drop the index before loading and re-create it afterward.

You can read more info on this at: https://msdn.microsoft.com/en-us/library/ms177445.aspx

Thursday, March 19, 2015

I changed the Recovery Model before BULK operation, is it a good practice?

I have already written a post on bulk operations, indicating the benefits we get by changing the recovery model before performing bulk operations. I got a couple offline questions after my SQL Server Stored Procedure session at SS SLUG and one question was this. I was talking about how plan cache getting flushed away and I mentioned that changing recovery model causes to get the plan cache flushed. Now the question is, should we change the recovery model for our bulk operation?

First of all, let's see whether changing the Recovery Model causes to get the plan cached removed. Here is the code I wrote for testing.

I am executing two SELECT statements, one against AdventureWorks2014 and another with AdventureWorksDW2014. The last statement checks plan cache.

-- EXECUTE STATEMENT SEPARATELY

-- Cleaning buffer
DBCC FREEPROCCACHE
DBCC DROPCLEANBUFFERS
GO
  
USE AdventureWorks2014;
GO

-- Query Sales.SalesOrderDetail table
SELECT * FROM Sales.SalesOrderDetail
WHERE SalesOrderDetailID = 1;
  

USE AdventureWorksDW2014;
GO

-- Query dbo.DimProduct
SELECT * FROM dbo.DimProduct
WHERE ProductKey = 1;

-- Checking chached plans
SELECT p.usecounts, p.cacheobjtype, p.objtype, t.text
FROM sys.dm_exec_cached_plans p
CROSS APPLY sys.dm_exec_sql_text(p.plan_handle) t
WHERE 
 TEXT LIKE '%SELECT * FROM Sales.SalesOrderDetail%'
 OR TEXT LIKE '%SELECT * FROM dbo.DimProduct%';

As we expected, we see two cached plans in plan cache.

This code changes the Recovery Model of AdventureWorks2014. Second statement shows plans in the cache after the change.

-- Change the Recovery Model
ALTER DATABASE AdventureWorks2014 SET RECOVERY Bulk_Logged;
GO

-- And see plans
SELECT p.usecounts, p.cacheobjtype, p.objtype, t.text
FROM sys.dm_exec_cached_plans p
CROSS APPLY sys.dm_exec_sql_text(p.plan_handle) t
WHERE 
 TEXT LIKE '%SELECT * FROM Sales.SalesOrderDetail%'
 OR TEXT LIKE '%SELECT * FROM dbo.DimProduct%';

As you see, plan related to AdventureWorks2014 has been removed. This proves that changing the Recovery Model flushes plans in the cache.

Now the question is, how advisable to do this in production environment for bulk operations. My recommendation is, if it is a large bulk operation and database is heavily accessed by users, bulk operation should be done during off-peak hours. We still lose cached plans but this will at least minimize the damage. However, if the database is continuously being used by users, then it is advisable to perform bulk operations without changing the Recovery Model.