top of page
Search

Book Pipe

Brandon Yau, Intern – 11/2021


Azure Data Processing Pipeline


            At Sophron.io we were given the task to create a pipeline that takes in data objects modeled after books and once the data was in we would filter, transform, and send them to a cloud based database. At first glance, this appears to be an easy task … take in json data, transform the data and then send it back in a synchronous manner, but upon further analysis there are pitfalls and consequences to approaching the problem like this. Parsing JSON and filtering/transforming is typically a quick process and does not require any callback; therefore, a synchronous approach to this is not unwarranted. However, sending data to a database is a taxing operation as each and every operation must wait for a call back from the database before ending the operation, preventing concurrent sends from the same program. With this in mind, we were able to design a pipeline that has the simplicity of a synchronous parsing program but the efficiency of an asynchronous running data sender. There were 5 main tools and resources used in making this pipeline: Python 3.8, Azure Logic App, Azure Functions, Azure Queues CosmosDB.


Figure 1: Azure Logic App
Figure 1: Azure Logic App

Shown above in Figure 1 is a higher level view of how book data is processed and transformed within our Azure Logic App. At the beginning of the diagram we have an oval to represent our book data which gets passed to an Azure Function which then is processed by Azure Queue Functions and finally will finish it’s journey at Azure’s CosmosDB.


We are utilizing Azure Functions to handle initial processing and parsing of JSON book data. The end result of this function is that each Book will be sent to an Azure Queue which will either be the PASS queue or the FAIL queue depending on if the book hits the correct parameters. Once the book has its designated queue it will be shipped off to the Queue using Azure’s standard Python Library Kit. In order to mitigate the time it takes for the function to call back we are handling the shipping of the books asynchronously. In short, when we “send” a book to the queue while it is “sending” we are able to process the next book and “send” it as well since all elements are independent of each other.


Once our books reach the Azure Queue we then tag it with a proper UUID and then send the now transformed and cleaned book into our CosmosDB. This step seems counter-intuitive in the sense of adding an extra unnecessary process to our logic app. However, due to the nature of database transformations and calls this function is pertinent to our logic app. A call to a database is much more taxing and time consuming than sending to a queue. An asynchronous approach could be taken towards this problem to mitigate it, however, when shipping off items to a DB, callbacks must be awaited and stalls the entire process. With an Azure Queue we are utilizing single instance function calls in order to scale out the amount of books we can process. By approaching the problem in this manner we are able to turn our synchronous process into an asynchronous process.


In our first attempt we handled everything synchronously and just with an Azure Function … It took 3 hours to process 11,000 books. In our final pipeline, we utilized the power of asynchronous processing and the ability of handling a synchronous process asynchronously (Azure Queues) in order to reduce this time from 3 hours to 8 minutes.

 
 
 

Comments


Connect with Us Today

115 E Main St.

Suite A1B - 6A 

Buford, GA 30518

 

© 2035 by Sophron Networks LLC. Powered and secured by Wix 

 

bottom of page