Redshift Limitations

Brief:

We had to implement redshift for one of our projects. But the limitations that we faced were so many with redshift so we moved on to use some other data store to implement our solution for reporting of data.

Need:

Since the project required us to process huge amount of data into a structure that is required forreporting. The data consisted of user activity on the website around the world. The website also hosted some live video chats and conferences online and also recorded these activities into the database. The daily activities on an average recorded each day where 50K+ rows in more than 10 tables. This website also recorded the chats between the users and their representatives.

Limitation that we faced:

As we were looking for the data ware house for the project we came across Amazon redshift.
Redshift is built over postgres and provides scalability for petabytes of data. It stores data in
columnar format. It uses parallel querying the nodes. It also provides jdbc and odbc drives so that we can use it with any sql client.

The major limitation that we faced was the limited data types available for Redshift. The varchar datatype stored a limited length of data. As the data stored chats too the data storage range required was longer than provided. If the redshift encountered any such data it would revert all the data was inserted. This resulted in loss of time and data that was processed. Since the client also wanted the chat data to be reflected into the reports it was mandatory add the chats as it is to redshift without any modifications.

We also faced limitations since the data contained special characters. Redshift does not accept special characters and reverted the whole insert if it encountered one, this also resulted in loss of data and time.

The other limitation that we faced with redshift was techniques of insertion of data to redshift . It provided intergration with amazon s3, dynamo db, amazon mapreduce. Our original data was stored in mysql on our server. So, we had to either process our data store it to S3 or dynamo db and than store it to the redshift database. This was very time consuming. So we tried using data pipelines. The issue with data pipelines was that it only copied schema to schema data. Also, since our mysql db lied inside closed network, there was no way that the data pipelines could connect to our db using data pipelines. Since data pipelines use a new instance every time it runs.

So, we went ahead to use a java library by using the java helpers that amazon provided. We hosted this java library to our server. This code worked efficiently for few records, but we had sanitized the data to remove special characters and limiting the length of data to the redshift standards. But this data lost its original form and this affected the reports presentation in the end.

This were actually the limitations that we faced because the data we had and the requirement for reports. Redshift worked well with aggregated data and also the output was faster.

This article is not to discourage the use of Redshift since many organizations are already using it. It

was just a wrong choice of data store that we used for the data we had.

Learning from above:

1. Study your data that is required and think about the output that is required first.

2. Look into data warehousing techniques that actually fit your data and requirements, this is an important decision and things might go way wrong if you fail at this step.

3. Model your data appropriately to according to the reports.

4. Pre-process your data before you store it into your data warehouse.

Podcast

Michael Patterson sat down with the CEO of Boston Byte, Mustapha Shaikh to discuss the significance and rapid digitization of the healthcar...