Pythian Blog: Technical Track

Seamless SQL Code Conversion: Transforming MS SQL to PostgreSQL with LangChain

In today’s data-driven world, database migration is a common challenge that many organizations face. Migrating from MS SQL to PostgreSQL can enhance performance, reduce costs, and improve scalability. In this post, we'll explore how to leverage LangChain and prompts to facilitate this conversion process efficiently.

A quick overview of the technologies involved.

  • Python (ver: 3.11): A programming language utilized to develop code, integrate LangChain agents with prompt templates, and facilitate code conversion during the migration process.
  • LangChain with OpenAI: A framework that integrates OpenAI's language models to create intelligent agents, enabling efficient query transformation and automated code conversion throughout the migration process. 
  • Docker:A containerization platform that dynamically spins up PostgreSQL containers, allowing for the testing and validation of queries for syntax and other issues, ensuring a smooth migration process.
  • MS SQL: A relational database management system that acts as the source for migration, where schema and code are converted to facilitate the transition to PostgreSQL.
  • PostgreSQL: A powerful open-source relational database management system that serves as the target for migration, where converted schema and code from MS SQL are implemented.

If you're interested in exploring LangChain further, particularly in terms of prompt engineering, you can find comprehensive documentation and examples on their official site. Check out the LangChain Prompts Documentation for more information.

The solution can be implemented as a web application or a REST API using FastAPI, allowing for seamless integration and easy access to the SQL conversion functionality. By wrapping the solution in this way, users can interact with it through a user-friendly interface or programmatically via API calls, enhancing usability and flexibility for various applications.

Pipeline Overview

In this section, we’ll delve into the pipeline design to understand how the transformation process works. We'll explore the flow of data from the input of MS SQL files into Python, the conversion to PostgreSQL code using LangChain and custom prompt templates, and the testing and validation conducted within a Docker container. By examining each step, we’ll gain insights into how the pipeline efficiently manages the conversion and addresses any potential errors.

  1. Input: MS SQL files are fed into a Python script.
  2. Conversion: Python utilizes custom prompt templates and LangChain to convert MS SQL files into PostgreSQL-compatible code.
  3. Testing and Validation: A dynamic container is spun up on Docker to test and validate the generated PostgreSQL output files.
  4. Output: The consolidated output includes the converted files and validation results, clearly separating successful conversions from any errors.

Now, let's review some code snippets, with comments included for clarity.

  • Prompt template: mssql_to_postgres.txtYou are an experienced software developer specializing in database query translation.
    <INSTRUCTIONS>
    1. Review the provided MSSQL query.
    2. Translate the MSSQL query into PostgreSQL syntax, ensuring it adheres to PostgreSQL standards.
    3. Retain and correctly format any comments present in the input.
    4. Ensure that the converted PostgreSQL query is valid, properly formatted, and executable in PostgreSQL.
    </INSTRUCTIONS>
    <INPUT>
    MSSQL Query: {ms_sql_query}
    </INPUT>
    <OUTPUT>
    </OUTPUT>
    <INSTRUCTIONS END>

 

  • Langchain service: langchain_service.py
# Define model parameters
MODEL_NAME = "gpt-3.5-turbo"
TEMPERATURE = 0.28
TOP_P = 0.9
MAX_TOKENS = 4096
N = 1

from langchain import LLMChain
from langchain.chat_models import ChatOpenAI
from openai import OpenAI
import constants

# Set up the OpenAI API key
client = OpenAI(api_key=constants.API_KEY)

def get_prompt_agent(prompt_template):
    """
    Creates and returns a LangChain agent that executes a prompt using a specified OpenAI language model with defined parameters.
    """
    # Create an OpenAI model instance with specified parameters
    llm = ChatOpenAI(
        model=constants.MODEL_NAME,
        temperature=constants.TEMPERATURE,
        model_kwargs={"top_p": constants.TOP_P},
        max_tokens=constants.MAX_TOKENS,
        n=constants.N,
    )

    # Create a LangChain agent using the prompt and LLM
    agent = LLMChain(prompt=prompt_template, llm=llm)

    return agent

 

Now that we’ve covered the code snippets, let's take a look at the requirements.txt file. This file lists the libraries used for development, which are essential for ensuring that the project runs smoothly and efficiently.

# code related packages -
langchain==0.2.14
langchain-community==0.2.12
langchain-core==0.2.34
langchain-text-splitters==0.2.2
openai==1.42.0
psycopg2==2.9.9
python-dotenv==1.0.1
types-psycopg2==2.9.21

# sorting, reformatting and checking errors -
bandit==1.7.10
black==24.8.0
flake8==7.1.1
isort==5.13.2
mypy==1.11.2

 

Let's examine the output generated after running the code conversion service.

Input (MS SQL code)

Output (PostgresQL code)

CREATE TABLE Employee (
    EmployeeID INT PRIMARY KEY IDENTITY(1,1),
    FirstName NVARCHAR(50) NOT NULL,
    LastName NVARCHAR(50) NOT NULL,
    Email NVARCHAR(100) UNIQUE NOT NULL,
    DepartmentID INT,
    CreatedAt DATETIME DEFAULT GETDATE(),
);

CREATE TABLE Employee (
  EmployeeID SERIAL PRIMARY KEY,
  FirstName VARCHAR(50) NOT NULL,
  LastName VARCHAR(50) NOT NULL,
  Email VARCHAR(100) UNIQUE NOT NULL,
  DepartmentID INT,
  CreatedAt TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

INSERT INTO Employee (FirstName, LastName, Email, DepartmentID)
VALUES
('John', 'Doe', 'john.doe@example.com', 1), 
('Jane', 'Smith', 'jane.smith@example.com', 2), 
('Emily', 'Johnson', 'emily.johnson@example.com', 3), 
('Michael', 'Brown', 'michael.brown@example.com', 1);  

INSERT INTO Employee (FirstName, LastName, Email, DepartmentID)
VALUES
('John', 'Doe', 'john.doe@example.com', 1),
('Jane', 'Smith', 'jane.smith@example.com', 2),
('Emily', 'Johnson', 'emily.johnson@example.com', 3),
('Michael', 'Brown', 'michael.brown@example.com', 1);  

SELECT TOP 5 * FROM Employee;

SELECT * FROM Employee LIMIT 5;

SELECT FirstName + ' ' + LastName AS FullName FROM Employee;

SELECT FirstName || ' ' || LastName AS FullName FROM Employee;

SELECT DATE_PART('year', AGE(HireDate)) AS YearsEmployed FROM Employee;

SELECT EXTRACT(year FROM AGE(HireDate)) AS YearsEmployed FROM Employee;

Query failed: column "hiredate" does not exist
LINE 2: SELECT EXTRACT(year FROM AGE(HireDate)) AS YearsEmployed FRO...

 

The query failed with the error: "column 'HireDate' does not exist." This highlights how the validation process effectively identifies missing columns (and such errors) during a dry run, ensuring that errors are caught before execution.

Conclusion

Finally, combining LangChain, Docker, and Python to migrate databases from MS SQL to PostgreSQL not only speeds up the operation but also takes advantage of modern tool capabilities for increased efficiency and reliability. Developers may automate complex transformations and assure data integrity throughout the migration by leveraging LangChain's rapid engineering. This strategy not only reduces errors but also makes it easier to handle data types that differ across the two systems.

No Comments Yet

Let us know what you think

Subscribe by email