Temporal Tables

I’ve recently been introduced Temporal Tables which is a new feature of SQL Server 2016. After studying this new feature, I have to say this is probably one of the features I’m the most excited about. I’m hoping to use this as a replacement to current auditing and versioning techniques that require triggers or extra columns.

Here is a simple script that will create a table.

CREATE TABLE dbo.Employee   
(    
  EmployeeID int IDENTITY(1,1) NOT NULL PRIMARY KEY CLUSTERED   
  , Name nvarchar(100) NOT NULL  
  , Position varchar(100) NOT NULL   
  , HealthPlan varchar(25) NOT NULL
  , UpdatedBy varchar(25) NOT NULL
		    CONSTRAINT DF_Employee_UpdatedBy DEFAULT(SUSER_NAME())
  , ValidFrom datetime2(2) GENERATED ALWAYS AS ROW START HIDDEN
  , ValidTo datetime2 (2) GENERATED ALWAYS AS ROW END HIDDEN
  , PERIOD FOR SYSTEM_TIME (ValidFrom, ValidTo)  
 )    
 WITH (SYSTEM_VERSIONING = ON (HISTORY_TABLE = dbo.EmployeeHistory));  

 

Since the validFrom and validTo days are flagged as Hidden, they do appear when we query using Select * from the table.  They will return however, if we specify them specifically.

The script creates 2 tables. Also, notice how the Employee table has a different icon than a regular table.

ssms_table

You can’t truncate a history table or run any updates or deletes against the history table while the source table still has versioning enabled. To make these kind of changes, first alter the table, make the changes, and then enabling versioning again.

 /* Sets versioning off */
ALTER TABLE dbo.Employee SET ( SYSTEM_VERSIONING = OFF) 
/* Make changes to the EmployeeHistory table */

 /* Sets versioning back on */
ALTER TABLE dbo.Employee SET ( SYSTEM_VERSIONING = ON) 

 

Another nice feature is that whenever you add a column to the source table, this column automatically gets added to the history table. This was a pain point previously where any new columns also potentially had to get added to additional audit tables and triggers manually. This left a lot of room for forgetting and auditing against the new column being missed.

Elastic{ON}16 – Takeaways – Part 3

Probably one of the biggest takeaways I gained at Elastic{ON} would have to be what I learned in regards to the integration with Hadoop, Hive and Spark.

Currently I do some work with integrating Hive to ElasticSearch.  I had noticed that it seemed like Hive would retrieve all the records first before filtering the result set based on the query. I also noticed that queries I would consider simple, such as just running a count, engaged the full stack.

0: jdbc:hive2://localhost:10000> SELECT COUNT(*) FROM testElasticSearch;
INFO : Number of reduce tasks determined at compile time: 1
INFO : In order to change the average load for a reducer (in bytes):
INFO : set hive.exec.reducers.bytes.per.reducer=<number>
INFO : In order to limit the maximum number of reducers:
INFO : set hive.exec.reducers.max=<number>
INFO : In order to set a constant number of reducers:
INFO : set mapreduce.job.reduces=<number>
INFO : number of splits:5
INFO : Submitting tokens for job: job_local503289697_0029
INFO : The url to track the job: http://localhost:8080/
INFO : Job running in-process (local Hadoop)
INFO : 2016-03-29 23:52:44,846 Stage-1 map = 0%, reduce = 0%
INFO : 2016-03-29 23:52:50,854 Stage-1 map = 5%, reduce = 0%
INFO : 2016-03-29 23:52:53,859 Stage-1 map = 7%, reduce = 0%
INFO : 2016-03-29 23:52:56,871 Stage-1 map = 9%, reduce = 0%
INFO : 2016-03-29 23:52:58,875 Stage-1 map = 12%, reduce = 0%
INFO : 2016-03-29 23:53:01,882 Stage-1 map = 100%, reduce = 0%
INFO : 2016-03-29 23:53:07,896 Stage-1 map = 25%, reduce = 0%
INFO : 2016-03-29 23:53:10,902 Stage-1 map = 28%, reduce = 0%
INFO : 2016-03-29 23:53:13,909 Stage-1 map = 30%, reduce = 0%
INFO : 2016-03-29 23:53:17,209 Stage-1 map = 100%, reduce = 0%
INFO : 2016-03-29 23:53:23,221 Stage-1 map = 45%, reduce = 0%
INFO : 2016-03-29 23:53:26,229 Stage-1 map = 48%, reduce = 0%
INFO : 2016-03-29 23:53:29,236 Stage-1 map = 50%, reduce = 0%
INFO : 2016-03-29 23:53:32,243 Stage-1 map = 100%, reduce = 0%
INFO : 2016-03-29 23:53:38,256 Stage-1 map = 65%, reduce = 0%
INFO : 2016-03-29 23:53:41,264 Stage-1 map = 68%, reduce = 0%
INFO : 2016-03-29 23:53:44,271 Stage-1 map = 71%, reduce = 0%
INFO : 2016-03-29 23:53:47,278 Stage-1 map = 73%, reduce = 0%
INFO : 2016-03-29 23:53:48,281 Stage-1 map = 100%, reduce = 0%
INFO : 2016-03-29 23:53:54,298 Stage-1 map = 85%, reduce = 0%
INFO : 2016-03-29 23:53:57,311 Stage-1 map = 88%, reduce = 0%
INFO : 2016-03-29 23:54:00,320 Stage-1 map = 91%, reduce = 0%
INFO : 2016-03-29 23:54:03,328 Stage-1 map = 100%, reduce = 100%
INFO : Ended Job = job_local503289697_0029
+---------+--+
| _c0 |
+---------+--+
| 371070 |
+---------+--+
1 row selected (80.921 seconds)

 

What happened to _count? I assumed that the integration would be able to examine the query before Hive passed it on, allowing it to manipulate the query text to be more efficient. At the conference, I learned that due to Hive, this simply isn’t possible yet. Costin referred to this operation as “Push-Down”.   Currently Spark is the only on that allows for this Push-Down operation.Pushdown

Costin addressed another challenge I’ve been facing which what I like to refer to as the field identity crisis. In the mapping, it does not explicitly call out fields that are arrays. This can cause the queries to fail when that field goes from a string to an array of strings.

Take this sample mapping:

{
   "testdata": {
      "mappings": {
         "stringarray": {
            "properties": {
               "userid": {
                  "properties": {
                     "id": {
                        "type": "long"
                     }
                  }
               }
            }
         }
      }
   }
}

 

In Hive, we would define the table as something like this:

CREATE EXTERNAL TABLE testdata (userid STRUCT<id:bigint>)
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES('es.resource' = 'testdata/stringarray',
'es.nodes' = 'xxxx',
'es.port' = '9200');

 

Now when we query the table, notice the error returned:

SELECT * FROM testdata;

Error: java.io.IOException: java.lang.ClassCastException (state=,code=0)

 

We receive this vague error due to the id field sometimes actually being an array of structs.

 "userid": [
                  {
                     "id": 1
                  },
                  {
                     "id": 2
                  }
           ]

 

This specific document would need to be handled with the below table definition.

CREATE EXTERNAL TABLE testdata (userid ARRAY&<STRUCT<id:bigint>>)
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES('es.resource' = 'testdata/stringarray',
'es.nodes' = 'xxxx',
'es.port' = '9200');

 

This is a challenge for Hive. We need the data to stay consistent to its type or we will not be able to successfully retrieve all the records from the index with Hive.

Elastic{ON}16 – Takeaways – Part 2

I attended the “Stories from Support” session at Elastic{ON}. There were a few new troubleshooting tips I learned as well as some recommendations that I was not aware of.

I recently had been looking for a rule of thumb for shard sizes.  I’m glad this was clarified during the session. The only information I was able to find for it online showed about 30GB to be the upper limit for a shard size. During the session, they stated this to be actually at 50GB.  I would also like to emphasize that this is a rule of thumb. Actual performance results will vary depending on a number of factors.

I also learned that a higher number of shards could actually negatively impact your  search times.  You also don’t get any benefit from having more shards than nodes if your index size is small. The old default of 5 shards per index is actually not great, especially if none of your indexes are not very large. It also isn’t that great if your indexes are time based and you are creating indexes daily. I’ve run into this myself with Logstash. If you leave the defaults in place, you will quickly have a high number of shards.

You can view your shards with the following command:

GET _cat/shards?v

Setting the number of shards can only be done at index creation time. Here is the command to set the number of shards:

PUT [newIndexName]
{
"number_of_shards" : xx
}

Elastic{ON}16 – Takeaways – Part 1

This was my first time at an ElasticSearch conference.  I had a great time, learned a few new things and even stumbled upon some new software I’m looking forward to trying out.

I’ve been to a few conferences before and typically my experience with the outside vendors hasn’t been anything too exciting.  The software would either cost more than we would be willing to pay or we already developed something internally that solved our need. I was pleasantly surprised when this was not the case at Elastic{ON}.   Two of the products I found the most interest in were Kibi and StreamSets.  Kibi a platform built on top of Kibana. For me, Kibana has had a few gaps. You cannot join two different indexes and you cannot reference external data sets.   With Kibi, you are able to do joins and even reference external data sets.  Since Kibi is built on top of Kibana, the interface is familiar and not too difficult to pick up if you have previous experience with Kibana. I’ve started looking at Kibi, but looks like to fully take advantage of everything, we need to update to ElasticSearch 2.0 which we are not quite ready to do yet.

The other piece of software that really sparked my interest was StreamSets. One of the difficulties in making a jump to some of these technologies has been the lack of any type of user interface.   Moving data to HDFS from SQL with Sqoop was not something the average user wanted to do.  StreamSets interface is visually appealing and a much easier tool to use for any user, especially those who will be newer to the big data world.

I did get a chance to setup and move some data with StreamSets. Setting up a SQL datasource, I found, was a little tricky at first. The drivers for MSSQL are not bundled with StreamSets. You have to add these yourself.  Also, when creating the connection, the password is stored in clear text.

streamSetsclear

 

Installing ElasticSearch in AWS

This post will walk through installing ElasticSearch in AWS using minimal setup and defaults.

Under the EC2 dashboard, click on Launch Instance.
– Choose Amazon Linux for the AMI.
– Choose the EC2 instance type you need.
freetier
*The t2.micro may be free tier eligible.

This next step is optional and the code for the user data can be executed manually later if you wish.

Under Advanced Details use the script below for the user data. This will install ElasticSearch 2.0.

#!/bin/bash
#Script installs ES
wget https://download.elasticsearch.org/elasticsearch/release/org/elasticsearch/distribution/rpm/elasticsearch/2.0.0/elasticsearch-2.0.0.rpm
yum install elasticsearch-2.0.0.rpm -y

Now click Review and Launch.

Once the system is up you an connect and start the ElasticSearch service.

# service elasticsearch start

Check the status:

# curl -i localhost:9200

We want to see something similar to this:

HTTP/1.1 200 OK
Content-Type: application/json; charset=UTF-8
Content-Length: 324

{
  "name" : "Hammer Harrison",
  "cluster_name" : "elasticsearch",
  "version" : {
    "number" : "2.0.0",
    "build_hash" : "de54438d6af8f9340d50c5c786151783ce7d6be5",
    "build_timestamp" : "2015-10-22T08:09:48Z",
    "build_snapshot" : false,
    "lucene_version" : "5.2.1"
  },
  "tagline" : "You Know, for Search"
}

And that completes a simple install of ElasticSearch in AWS!