Atlassian was always an advocate of the data warehouse-style architecture, according to the company’s data platform senior manager Rohan Dhupelia.

At one point the company was running two data warehouses. One was a PostgreSQL data warehouse that was used to power business intelligence and the company’s dashboard needs, and was typically used by finance, support, and marketing.

The second was an Amazon Redshift data warehouse for research and development. 

“It was here that we shipped all of our Clickstream data from our products, and used notebooks and SQL analytics to understand the user journey and patterns through our products,” Dhupelia explained, during the keynote of the virtual Data+AI Summit 2021.

But having two data lakes did not do any favours for Atlassian as it ended up causing the company more problems.

“Primarily, we noticed that a large number of datasets were typically being copied across from one data warehouse to another. These copies were brittle and often added delays to downstream pipelines and analysis,” Dhupelia said.

Other issues the company came across included noticing that different syntaxes existed between the two data warehouses, which made it difficult to covert queries between the two, and it was becoming a costly exercise to pool data from two data warehouse together.  

“As a result, a lot of analysis just didn’t happen because the engineering tax was just way too high,” Dhupelia said.

It was at that point the company re-evaluated its architecture and opted to trade its two data warehouses for a single S3 data lake architecture. While positive outcomes were achieved because of the switch, including less “engineering tax” and having the ability to scale infinitely, the performance of data lake was not up to scratch, however.

“We could manage to get relatively good concurrency with Presto, but smaller queries were still not returning as fast as they did in the data warehouse architecture. Also, modelling data for dashboards and BI use cases was quite difficult,” Dhupelia explained.

It also meant a high barrier to entry for data analytics and science use cases.

“Our data platform team was becoming the bottleneck for users wanting to do anything advanced on the platform,” Dhupelia said. “Often users had to ask us to add them to create a cluster or add particular libraries to their cluster.”

For Dhupelia, the solution was implementing Databricks into the environment, which he said has moved the company closer to achieving a “nirvana state”.

“We are now able to perform queries much faster, thanks in part due to Databricks’ optimised runtimes, but also as a result of the optimisations that came with converting tables to the data lake format. This meant an improved experience for business intelligence style use cases,” he said.

In the coming months, Atlassian plans to move more business intelligence workloads into Databricks, following recent trials of Databricks SQL.

“We’re also planning on moving more tables towards the data lake to further improve that performance, but also to simplify workloads that need strong dimensional modelling,” Dhupelia added.

“We’re looking at ways that we can enable more sensitive use cases by using Immuta, which is a self-service data access and privacy control layer on top of that data lake.

“At Atlassian, we have proven that there is no longer a need for two separate data things. Technology has advanced far enough for us to consider one single unified lake house architecture.”

Related Coverage



Source link