PhilipMat

TIL: Jetbrains DataGrip -- Automatically start VPN connection

A good security approach when working with Azure-hosted databases is to connect through a VPN tunnel.

This typically involves installing the Azure VPN Client and importing a profile file, and then starting that VPN tunnel before connecting to the database.

On macOS the above creates a system-wide VPN profile, which means it can be started from the command line with:

scutil --nc start "vpn profile name"

Which in turn means that we can have DataGrip automatically start it before opening a connection to the database.

Steps are as following:

  1. Locate the name of the VPN profile in macOS System Settings -> VPN.
    Mine is prod-shared-vnet.
  2. After selecting the desired data source, open the Options panel.
  3. Find the Before connection section, click + to Add New Configuration and select Run External Tool.
  4. Create a new tool with an explicit name like “vpn connect”, then enter:
    1. Program: scutil
    2. Arguments: --nc start prod-shared-vnet, or whatever is your VPN connection name.
  5. Make sure the “vpn connect” external tool has been selected before returning to the Before connection section.

Note:

  • executing the first query fast enough, before giving the tunnel a chance to fully connect, might fail with a “Cannot find server” or something similar; retrying works pretty reliably.
  • for other connections to the same server, just select the existing tool.
  • scutil --nc stop prod-shared-vnet disconnects the VPN tunnel. It seems to also drop when the computer goes to sleep.

Screenshots below.

DataGrip - VPN setup step - Before Connection + Add New Configuration
DataGrip - VPN setup - Create External Tool
DataGrip - VPN setup - Select External Tool
DataGrip - VPN setup step - Before Connection + Existing Tool

TIL: VCR.py -- records and replays HTTP calls for testing

Summary

VCR.py is a Python library inspired by Ruby’s VCR, which simplifies and accelerates testing that involves HTTP requests. VCR.py records HTTP interactions the first time they occur and saves them in a ‘cassette’ file. On subsequent test runs, it replays these interactions, eliminating the need for actual HTTP traffic. This approach allows for offline testing, ensures deterministic test results, and speeds up test execution. If the API changes, simply delete the cassette files, and VCR.py will record new interactions, keeping tests up-to-date.

pytest-recording plugs into pytest to create re-usable API-calling tests.

scotch and Betamax.NET are the VCR.py .NET alternatives, neither of which have been updated in a few years.

Source: VCR.py – records and replays HTTP calls for testing

Automating TIL Posts from GitHub Issues

I read a lot of things on the internet and rarely write about them. The friction of opening an editor, writing front matter, crafting a summary, and opening a PR is just high enough that it doesn’t happen. So I set up a workflow to do most of it for me.

The idea: create a GitHub issue with a link (or just some notes), and a GitHub Action kicks off, fetches the article, asks an LLM to summarize it and generate tags, then opens a PR with a ready-to-publish TIL post. All I have to do is review and merge.

How It Works

The workflow lives in .github/workflows/til.yml and triggers on any issue I open on the repository (it checks the issue author against the repo owner, so it ignores issues from anyone else).

When it fires, a Python script (.github/scripts/create_til.py) does the following:

  1. Fetches the issue via the GitHub API – title and body.
  2. Extracts a URL from the body if one is present. If found, trafilatura fetches the page and strips it down to the main article text, dropping nav, ads, and boilerplate.
  3. Calls OpenRouter (GPT-4o) with the article content and any notes I added to the issue body. The LLM returns a JSON object with a title, URL slug, 1-2 paragraph summary, a one-line snippet, and a set of tags.
  4. Assembles the post – front matter plus body – and writes it to _posts/YYYY-MM-DD-{slug}.md.
  5. Opens a PR and posts a comment on the original issue with a link to it.

If there’s no URL in the issue, the body text is used directly as the post content and the LLM only generates the metadata (title, slug, tags).

The Post Format

Each generated post gets the standard layout: post front matter plus a few extra fields:

---
layout: post
title: "TIL: How Vector Databases Work"
tags: [databases, ai, embeddings]
source_url: https://example.com/the-article
snippet: A look at approximate nearest neighbor search in vector databases.
---

The snippet field is already used by the archive page template, so TIL posts show up with descriptions in the archive without any template changes. The tags are stored for future use – the site doesn’t render them yet.

What I Had to Set Up

One secret in the repo: OPENROUTER_API_KEY. That’s it. The GITHUB_TOKEN is provided automatically by Actions.

The full plan for this workflow is in plan-opus.md in the repo root, including edge case handling, prompt design, and cost estimates.

This work was done using Claude Opus for planning, Codex for implementation (with an alternative by Kimi), and Sonnet for review.

TIL: Building a self-updating profile README for GitHub

Summary

Today I learned about a new feature on GitHub that allows you to create a profile README by setting up a repository with the same name as your GitHub account.

  1. Create a repo with the same name as your account: github.com/foobar/foobar;
  2. Add a README.md to it; and
  3. GitHub will render the contents at the top of your personal profile page.

Bonus: automate update with GH Actions

Source: Building a self-updating profile README for GitHub

Connect From PyODBC to SQL Server in DataBricks Cluster

It’s a shame that clusters created in Azure DataBricks don’t have the pyodbc drivers for Microsoft SQL Server installed by default.
Perhaps the people who normally use DataBricks don’t typically connect to SQL Server, but it feels that it’d be easier if the images provided by Microsoft would enable that by default.

Then, to make things more difficult, a lot of the answers on the Internet on how to install these drivers are for the older versions of DataBricks.
The newer versions change slighly the approach used to install them.

The gist of the install process is as following:

  1. Create an install shell-script to install the msodbcsql17 package
  2. Save it as a file in DataBricks, whether as a Workspace script or in a connected repository.
  3. When configuring the cluster, add this file as an “init script”.

Install Script

The script below is centered around installing the msodbcsql17 driver.
For this, we need to import the Ubuntu sources for apt and then install the package along with the unixodbc-dev:

curl https://packages.microsoft.com/keys/microsoft.asc | apt-key add -
curl https://packages.microsoft.com/config/ubuntu/16.04/prod.list > /etc/apt/sources.list.d/mssql-release.list
apt-get update
ACCEPT_EULA=Y apt-get install msodbcsql17
apt-get -y install unixodbc-dev
sudo apt-get install python3-pip -y

Saving Script in DataBricks

Save the script as a file, say pyodb-setup.sh, in either a DataBricks repository:

PyODBC Setup File in Connected Repo

or directly in a Workspace file (in the example below, stored in Workspace/Shared/pyodbc-setup.sh):

PyODBC Setup File in Workspace Shared Folder

Configure DataBricks Cluster

Either during creation, or after, configure the DataBricks cluster by expanding the “Advanced Options” section on the “Configuration” tab, then by selecting the “Init Scripts” tab.
There will be three options to add a new script: “Workspace”, “ABFSS”, and “DBFS”.

The “DBFS” is deprecated, and the “ABFSS” (Azure Blob File System or Azure Data Lake Storage Gen) is a bit more complicated to set up; the “Workspace” approach outlined above is the simplest.

The path to the pyodbc-setup.sh script is relative to the root of “Workspace”, so /Shared/pyodb-init.sh or /Repos/user/repo/pyodb-install.sh if in the repository.

Setting up init scripts in DataBricks cluster config

Connecting

This whole setup allows pyodbc to connect to a SQL Server using a connection string specifying the SQL Server 17 driver, "DRIVER=ODBC Driver 17 for SQL Server;...", like so:

server = "example-server.database.windows.net"
db_name = "example-db"
user = "example_user"
pwd = "example password"
conn_string = f"DRIVER=;SERVER={server};DATABASE={db_name};UID={user};PWD={pwd}"
conn = pyodbc.connect(conn_string)
with conn:
  conn.execute("select * from table")

As a note, there’s a newer version, msodbcsql18 – see here the whole list.
The script remains the same, save for pointing to a different Ubuntu, for example: https://packages.microsoft.com/config/ubuntu/18.04/prod.list.