Pythian Blog: Technical Track

Working with XML Data in BigQuery

In BigQuery, loading and unloading of XML Data is not supported. The only way to load/unload XML Data into/from BigQuery is to use external tools like Python or other third-party tools.

In my use case, we were required to unload/export customer data from BigQuery into an XML file. We created a Python program to read the BigQuery data and load it into an XML formatted file.

Briefly, let's understand what XML is and why it is still around.

XML (eXtensible Markup Language) is a versatile markup language for storing and transporting data. It's commonly used to structure, store, and share data across different systems in a readable format. XML uses tags to define elements within a document, creating a hierarchical structure that's both human-readable and machine-readable.

XML remains relevant and is still used for several reasons despite the emergence of other data formats like JSON, YAML, and others:

  • Structured Data Representation: XML provides a way to structure and organize data hierarchically using tags, making it easy for both humans and machines to read and understand.
  • Industry Standards and Legacy Systems: Many industries and systems have been built around XML. Legacy systems, industry standards, and specific protocols (like SOAP for web services) still heavily rely on XML.
  • Document Structure and Metadata: XML's flexibility allows for the definition of complex document structures and the inclusion of metadata, making it suitable for document-based data representation.
  • Data Interchange and Integration: XML facilitates data exchange between different systems, platforms, and programming languages. It's used in various scenarios such as web services, configuration files, data interchange between applications, etc.
  • Support for Validation: XML Schema (XSD) enables the definition of strict rules for the structure and content of XML documents, ensuring data integrity and validation.
  • Longevity and Familiarity: XML has been around for a long time and is well-understood by many developers and organizations. Existing knowledge and tooling make it easier to work with XML in certain contexts.
  • Extensibility: As the 'X' in XML suggests, it's extensible. This means you can define your tags and structure, making it adaptable to different use cases and evolving requirements.

However, it's essential to note that XML's usage has somewhat diminished in certain areas due to the emergence of more lightweight and easy-to-parse formats like JSON, especially in web-based applications and APIs. JSON is often preferred for its simplicity, compactness, and ease of use in JavaScript environments.

The choice between XML and other data formats often depends on factors like existing infrastructure, industry standards, data complexity, interoperability needs, and specific use case requirements. For newer applications, JSON or other formats might be preferred, while for certain legacy systems or industries, XML remains a prevalent choice due to established practices and standards.

 

Customer Use Case

Unload customer data from BigQuery into Structured XML format and push that XML file into an SFTP location.

We followed two steps to accomplish this. 

  • Create a Python script that reads the BigQuery data and structures the XML file before sending it to the SFTP location.
  • Use Google Cloud Composer, also known as Airflow, to orchestrate and push the file to an SFTP location.

 

Python Code

To unload data from BigQuery into XML using Python, you can use the BigQuery Python client library to fetch the data and an XML library, such as `xml.etree.ElementTree`, to create the XML file. Here's a step-by-step guide:

  1. Install the required libraries:
`pip install google-cloud-storage`

`pip install google-cloud-bigquery`
  1. Set up your Google Cloud credentials. You can either set the `GOOGLE_APPLICATION_CREDENTIALS` environment variable to the path of your service account key file or use `google.cloud` library to authenticate.
  2. Here is the link to the code to unload data into XML:

DAG Code link: https://github.com/pythian/blog-files/tree/offload_xml_from_BQ

Make sure to replace the placeholder values in the script with your actual project, dataset, and table information. This script fetches data from BigQuery, creates an XML structure, and saves it to the specified XML file.

 

Airlfow DAG

 

References:

https://realpython.com/python-xml-parser/

https://docs.python.org/3/library/xml.etree.elementtree.html

No Comments Yet

Let us know what you think

Subscribe by email