Pythian Blog: Technical Track

News and upates from Microsoft Azure, pt II

I recently joined Chris Presley for his podcast, Cloudscape , to talk about what’s happening in the world of cloud-related matters. I shared the most recent events surrounding Microsoft Azure. Topics of discussion included:
  • Azure DW - instant data movement
  • Azure Data Lake storage Gen2 in preview
  • Azure IoT Edge is now generally available (GA)
  • Azure Container Instances generally available
Azure DW - instant data movement This is a very technical improvement that the Microsoft team has done to Azure SQL Data Warehouse. Data Warehouse (DW) is an MPP system, a distributed database, that will distribute your data warehouse on multiple machines. Let’s say you have four machines. You can scale it up or down to be two machines or more. When you’re doing a query and the tables don’t match up on their distribution key, what happens under the covers is the data has to be moved back and forth between nodes. This happens with pretty much all distributed databases when there’s no data locality. Before we had instant data movement, this operation called the shuffle move was very expensive to perform. If you had a fact table and this fact table was columnar storage, when the shuffle had to happen, the data was taken out of the database engine process into a separate service called the data movement service. It would decompress it from the columnar storage, turn it into rows, put it in the temporary database of the other node, and then finally join the information. That sounds convoluted and it is convoluted. For a long time, they have been working on how to improve this process. That’s how they finally developed instant data movement. The improvement is that there is no external service. There is no data movement service to resolve a shuffle move anymore. They basically merged the shuffle move operation inside the engine so the data doesn’t have to move out of the process. On top of that, if the data is coming in columnar form, there are optimizations so that it travels through the network in that same format, goes in memory on the other node, and then it gets joined. It doesn’t go through that rows conversion and it doesn’t hit the temporary space in the nodes. This is a much simpler process that leads to increased response. Basically, for queries that we’re doing lots of shuffle moves, based on some numbers that I’ve seen, is about 50% improvement in general. If a query had a high cost because of shuffle move, you can get up to 50% improvement in performance because of instant data movement. That’s interesting for the people that are techies, right? It’s cool how they do this optimization, but I think the bigger message here is that this is the nice thing about running on cloud-first software. If you're a user of DW today, when they enable the instant data movement, you didn’t have to do anything, your reports just got 50% faster. You didn’t have to take an outage to patch to get instant data movement. They just enabled it on their side by updating the binaries. You probably didn’t even realize that they did it, and suddenly everything runs faster. It’s just one of the huge differences in this new type of cloud model. Where it’s just constant software, new features released. Azure Data Lake storage Gen2 in preview Azure Data Lake is an interesting service because Microsoft came out with it a couple of years ago and it was split into two different components - the storage service, called Azure Data Lake Store, and the analytics service, called Azure Data Lake Analytics. The analytics service is a really cool service. You can just write some SQL and it does basically a serverless job execution. You pay for the compute for your query, and you get your results. But the service never gained a lot of traction because the Data Lake store itself was never deployed widely. Azure has now about 50 regions around the world and the service has always been kept in only six or eight regions. It’s not widely deployed and this affects the adoption because nobody wants to adopt a service that’s not in their region. If I’m in Ireland, for example, I don’t want to adopt a service that is only available in Eastern US, right? Now, I think that somebody at Microsoft sat down and said, “Guys, we have built two storage services. We already have Blob storage which is widely used, our clients know it very well and we are continuing to develop features in Blob storage. But at the same time, we’re kind of developing another storage service.” It didn’t really make a lot of sense. When they came out with the announcement of Gen2, it suddenly clicked on my mind why they never deployed Gen1. It’s because they have been working on Gen2 all this time. They are taking the unique features of Data Lake store and they are putting them on top of the regular Blob storage, which is probably what they should have done two or three years ago. There are a couple of interesting differences between Data Lake store and just regular Blob storage. For example, Blob storage does not have the ACL type of security that we’re used when dealing with our file systems - when you right click for the windows and you get the list of all the identities that you have in your directory and you click read/write execute kind of thing. That’s not how Blob storage security works. It’s usually through a shared access signature where you have a key and then share the temporary key so that you can access the storage with that key. But it’s not tied to an identity. With Data Lake, they want to make it more like working with a consumer file system. It’s going to have that type of security ACL. It’s also going to have folders and a hierarchical structure, which does not exist today directly in Blob storage. This is an interesting and important difference that a lot of people don’t consider. Even though we can create containers and we can put slashes on them to make it look like we have folders, they are flat at the end of the day and they are not truly hierarchical. If you change a thousand files from one folder to the other, it’s actually a thousand operations under the covers to be able to move those files, as opposed to just being a metadata operation. The other change they’re doing is that the Data Lake store folders - the file moving, renaming - all these are going to be first-level operations that are going to be very efficient. They’re going to build all that on top of Blob storage and at the same time, by doing that, they get all the advantages that Blob storage already has, such as cool data, archive tier, lifecycle management and so on. This is very smart in my opinion. It’s what should have happened two or three years ago. They should have never gone with an independent service that they were building from scratch again, but it looks like they’re changing course now and they’re building this data lake store on top of Blob storage. Eventually, we’re going to get the best of both worlds and I’m sure once that’s done Gen2 is going to be widely deployed around the world. Azure IoT Edge is now generally available (GA) What we are seeing is cloud providers coming up with more technology to support these disconnected or hybrid scenarios where you want to have a lot of storage that you eventually want to move to the cloud and you might be in a rugged environment and whatnot. In the case of the IoT Edge, it’s basically a runtime that you can host inside a server or you can host inside devices, depending on how much compute power you want to have available. Then it gives you a small subset of the IoT services from Azure in a disconnected fashion. Or it could even be in a connected fashion, but then you can aggregate before you move to the cloud. For example, if you were to have a factory and it has a thousand sensors, it does not make a lot of sense for a thousand sensors to directly communicate to the cloud, right? You could run the IOT Edge instead. Let’s say you put one server in your factory running the IOT runtime. The IOT runtime has a stream analytics engine inside which usually will be in the cloud, but in this case, it’s hosted inside this server. Then the stream analytics engine aggregates the information from the thousand factory sensors and only sends the aggregated data up to Azure. That makes a lot of sense in IOT scenarios where not every single event or every single data point has to make it all the way to the cloud. That decreases data latency without sacrificing much accuracy if your data generation frequency is very high. For places such as drilling platforms where you might not get 100% connectivity, you could have a server running the IOT Edge runtime. Everything continues to update, aggregate and use the same API that you would have in Azure. When the IOT Edge server regains connectivity, you can start uploading that data for permanent storage in the cloud. I think IOT, in particular, is a good use case to be able to use these local runtimes where not absolutely everything has to go to the cloud. Azure Container Instances generally available The Azure container instances service has gone GA. I was talking with some clients a few days ago about different ways to run containers in Azure and, because we see a little bit of redundancy, there is Azure Kubernetes service and then there is also an Azure Container Instances service. So, there is a little bit of confusion there. The idea is that if you need a full container orchestration solution with the automatic scaling or if you need to coordinate between containers, then you would use the Azure Kubernetes service. But Microsoft also figured out that some people didn’t want to deal with the added complexity of the orchestration or they didn’t really need it at all. All they wanted was simple service where they just dump their container somewhere or point it to an online repo of the container and they just want to execute it serverless and only pay by the second. For example, how many clients do we have that have VM’s that only get booted up for one single function? They perform one thing and then as soon as that one thing is performed, whether it’s a job or whether it’s an ETL or it is some automation or some sort of custom application, and then once that VM has done its work it just shuts down. Obviously, you can replace that with a container and maybe you don’t need the complexity of full Kubernetes. This is what this service is coming in to cover. If you are just looking to run and isolate a container, that just has a simple application. You can do Linux or Windows obviously like as usual. Nowadays every compute service is pretty much Linux or Windows friendly. This was a summary of the Microsoft Azure topics we discussed during the podcast, Chris also welcomed Greg Baker (Amazon Web Services) who discussed topics related to his expertise. Listen to the full conversation and be sure to subscribe to the podcast to be notified when a new episode has been released.  

No Comments Yet

Let us know what you think

Subscribe by email