Connect From PyODBC to SQL Server in DataBricks Cluster

Sun, Sep 10 2023

It’s a shame that clusters created in Azure DataBricks don’t have the pyodbc drivers for Microsoft SQL Server installed by default.
Perhaps the people who normally use DataBricks don’t typically connect to SQL Server, but it feels that it’d be easier if the images provided by Microsoft would enable that by default.

Then, to make things more difficult, a lot of the answers on the Internet on how to install these drivers are for the older versions of DataBricks.
The newer versions change slighly the approach used to install them.

The gist of the install process is as following:

Create an install shell-script to install the msodbcsql17 package
Save it as a file in DataBricks, whether as a Workspace script or in a connected repository.
When configuring the cluster, add this file as an “init script”.

Install Script

The script below is centered around installing the msodbcsql17 driver.
For this, we need to import the Ubuntu sources for apt and then install the package along with the unixodbc-dev:

curl https://packages.microsoft.com/keys/microsoft.asc | apt-key add -
curl https://packages.microsoft.com/config/ubuntu/16.04/prod.list > /etc/apt/sources.list.d/mssql-release.list
apt-get update
ACCEPT_EULA=Y apt-get install msodbcsql17
apt-get -y install unixodbc-dev
sudo apt-get install python3-pip -y

Saving Script in DataBricks

Save the script as a file, say pyodb-setup.sh, in either a DataBricks repository:

PyODBC Setup File in Connected Repo

or directly in a Workspace file (in the example below, stored in Workspace/Shared/pyodbc-setup.sh):

PyODBC Setup File in Workspace Shared Folder

Configure DataBricks Cluster

Either during creation, or after, configure the DataBricks cluster by expanding the “Advanced Options” section on the “Configuration” tab, then by selecting the “Init Scripts” tab.
There will be three options to add a new script: “Workspace”, “ABFSS”, and “DBFS”.

The “DBFS” is deprecated, and the “ABFSS” (Azure Blob File System or Azure Data Lake Storage Gen) is a bit more complicated to set up; the “Workspace” approach outlined above is the simplest.

The path to the pyodbc-setup.sh script is relative to the root of “Workspace”, so /Shared/pyodb-init.sh or /Repos/user/repo/pyodb-install.sh if in the repository.

Setting up init scripts in DataBricks cluster config

Connecting

This whole setup allows pyodbc to connect to a SQL Server using a connection string specifying the SQL Server 17 driver, "DRIVER=ODBC Driver 17 for SQL Server;...", like so:

server = "example-server.database.windows.net"
db_name = "example-db"
user = "example_user"
pwd = "example password"
conn_string = f"DRIVER=;SERVER={server};DATABASE={db_name};UID={user};PWD={pwd}"
conn = pyodbc.connect(conn_string)
with conn:
  conn.execute("select * from table")

As a note, there’s a newer version, msodbcsql18 – see here the whole list.
The script remains the same, save for pointing to a different Ubuntu, for example: https://packages.microsoft.com/config/ubuntu/18.04/prod.list.

Cached Claims when Using Windows Authentication in ASP.NET Core

Tue, Jan 26 2021

In Loading Claims when Using Windows Authentication in ASP.NET Core we examined an approach for injecting Claims into the ClaimsPrincipal in order to enable policy usage – [Authorization(Policy = "SomePolicy")] – on controller actions.

One of the purposes of the IClaimsTransformation implementation is to provide an easier, and somewhat efficient, way to use authorization policies. As such, we wouldn’t be wrong to perform some expensive operations in the class implementing this interface. For example, querying a database.

Having than happen on every request is a bit more than annoying while in development.

While we cannot avoid the calls to the claims transformer, we can avoid the expensive calls by using a caching approach.
The title is misleading a bit at this point. We will be caching the expensive calls and not the claims.

In the Windows claims example, we have MagicPowersInfoProvider as a way to provide information to the claims transformer, MyClaimsLoader, which in turn determines whether a claim needs to be added to the ClaimsIdentity (in TransformAsync).

MagicPowersInfoProvider is registered as a singleton, which makes is a good place to handle caching.

services.AddSingleton<Auth.MagicPowersInfoProvider>();

It only makes sense to cache when running under IIS Express.
Luckily, we don’t need to perform any complex detection of IIS Express. We just need to modify the launchSettings.json file to add an environment variable:

{
  "profiles": {
    "IIS Express": {
      "commandName": "IISExpress",
      "launchBrowser": true,
      "environmentVariables": {
        "ASPNETCORE_ENVIRONMENT": "Development",
        "CacheClaims": "true"
      }
    },
  ...

Then MagicPowersInfoProvider can make use of an injected IMemoryCache when the "CacheClaims" key is true, which would only be when running the application from Visual Studio and under IIS Express.

public class MagicPowersInfoProvider
{
    private const string CacheClaimsKey = "CacheClaims";
    private const int ClaimCacheInSeconds = 5 * 60;
    private readonly bool _cacheClaims;
    private readonly IMemoryCache _memoryCache;

    public MagicPowersInfoProvider(IConfiguration config, IMemoryCache memoryCache)
    {
        _memoryCache = memoryCache;
        _cacheClaims = config.GetValue<bool>(CacheClaimsKey);
    }

    public async Task<bool> CanHasPowerAsync(string userId)
    {
        if (!_cacheClaims)
        {
            return await ExpensiveHasPowerOperation(userId);
        }

        return await _memoryCache.GetOrCreateAsync<bool>(
            $"power-{userId}",
            async cacheEntry =>
            {
                cacheEntry.SlidingExpiration = TimeSpan.FromSeconds(ClaimCacheInSeconds);
                bool hasPower = await ExpensiveHasPowerOperation(userId);
                return hasPower;
            });
    }

    private Task<bool> ExpensiveHasPowerOperation(string userId)
        => Task.FromResult(true);
}

For a full example, containing all the code, see this repo.

Replicating macOS's say command in Windows

Fri, Dec 07 2018

Is say, macOS has a wonderful command line utility which I found to be useful to use in conjuction with long-running processes or even debugging to help draw attention, more so that typical beeping would do.

In short, say speaks text - is a Text-to-Speech (TTS) program.

say "Hello, there"

I wanted something similar on Windows and while there’s no direct equivalent, luckily .NET provides an entire host of utilities through the System.Speech.Synthesis namespace.

The say command has a number of parameter, mostly dealing with technical attributes such as voice selection (the speaker), output of spoken text, quality, e.t.c.

For this example, we’ll stick with the default voice of the speech synthesizer. As such, the solution is really simple using a Powershell script:

[cmdletbinding()]
param(
    [Parameter(Position = 1, Mandatory = $true)]
    [String]
    $message
)
Add-Type -AssemblyName System.Speech
$synth = New-Object -TypeName System.Speech.Synthesis.SpeechSynthesizer
$synth.Speak($message)

The Position-al parameter binding allow us to either call it directly:

say.ps1 'Hello there'

Or pass is with a switch argument:

say.ps1 -message 'Hello there'

I wish Visual Studio still had the ability to call macros on breakpoint because the code could translate into a one-liner in C#:

new System.Speech.Synthesis.SpeechSynthesizer()
  .Speak("Breakpoint hit");

As such, one would have to wrap it first into a method that can then get called when a breakpoint is hit.

Debug action with speech synthesis

It would be interesting to replicate the rest of the commands, in particular the voices since that would also allow for proper I18N speech synthesis.

Enumerating Directly to a FileResult

Sun, Nov 25 2018

ASP.NET Core controllers have a series of file method that deal with returning files to the browser.

The methods can be grouped in three categories, all returning a FileResult:

returning a file specified by path: File(String, ...) and friends;
returning an array of bytes making up a file contents: File(byte[], ...), etc;
returning a file whose content is read from a stream: File(Stream, ...), etc.

As of v2.1 there’s over 20 File(...) methods supporting the various scenarios of these 3 categories.

What is missing, likely because it doesn’t fit with the simplicity of types used in the other signatures, is the ability to write an enumeration directly to the output stream and do so with minimum memory usage.

An good example would be to serve the results of a query as CSV.

There are certain ways to work with the existing methods, for example we could write the enumeration to a file and then use on of the File(string, ...) methods to serve the file.
Another approach would be to write it into a stream, in memory (MemoryStream), re-setting the stream position to 0 and using the File(Stream, ...) signatures.

Both of those approaches are inefficient in that they either perform unneeded I/O or use unnecessary memory.

In either of these cases, the more efficient approach would be enumerating/iterating through the records rather than materializing the entire dataset, and writing them out to the response stream one by one.

To do so we’ll construct our own implementation of the abstract FileResult class. The source of inspiration is the fact that each of the File methods mentioned above returns their own type of FileResult: FileContentResult, FileStreamResult, etc.

Our implementation, EnumerableFileResult will accept an IEnumerable and will write its elements one by one to the Response.Body stream (System.IO.Stream).

However EnumerableFileResult would not know how to write the elements to the stream, so it will delegate that resposibility to an adapter class, one implementing an proposed IStreamWritingAdapter interface.

To make the this whole implementation even more flexible, we’ll consider allowing the adapter to write a header, for example the column names, and since we’re there allow it to write a footer too (maybe the total record count?).

The IStreamWritingAdapter looks like the following:

public interface IStreamWritingAdapter<T>
{
    string ContentType { get; }

    Task WriteHeaderAsync(Stream stream);

    Task WriteAsync(T item, Stream stream);

    Task WriteFooterAsync(Stream stream, int recordCount);
}

The ContentType ties the adapter to the file type, after all they’re in synca, and allows the adapter to inform the FileResult parent of the MIME content-type of the content dispatched to the caller.

To recap:

EnumerableFileResult<T> inherits from FileResult;
- Accepts an IEnumerable<T>;
- Uses an IStreamWritingAdapter<T> to write each element of the enumeration to a Stream.

class EnumerableFileResult<T> : FileResult
{
    private readonly IEnumerable<T> _enumeration;
    private readonly IStreamWritingAdapter<T> _writer;

    public EnumerableFileResult(
        IEnumerable<T> enumeration,
        IStreamWritingAdapter<T> writer)
        : base(writer.ContentType)
    {
        _enumeration = enumeration ?? throw new ArgumentNullException(nameof(enumeration));
        _writer = writer ?? throw new ArgumentNullException(nameof(writer));
    }

    public override async Task ExecuteResultAsync(ActionContext context)
    {
        SetContentType(context);
        SetContentDispositionHeader(context);

        await WriteContentAsync(context).ConfigureAwait(false);
    }

    private async Task WriteContentAsync(ActionContext context)
    {
        var body = context.HttpContext.Response.Body;
        await _writer.WriteHeaderAsync(body).ConfigureAwait(false);
        int recordCount = 0;
        foreach (var item in _enumeration)
        {
            await _writer.WriteAsync(item, body).ConfigureAwait(false);
            recordCount++;
        }

        await _writer.WriteFooterAsync(body, recordCount);

        await base.ExecuteResultAsync(context).ConfigureAwait(false);
    }
    ...
}

The usage is fairly straighforward; within a controller or a page handler, we return an instance of the EnumerableFileResult initialized with the enumeration and the writer:

public IActionResult OnDownload() {
    IEnumerable<Person> people = GetPeople();
    return new EnumerableFileResult<Person>(
        people,
        new PeopleToCsvWriter()) {
            FileDownloadName = "People.csv"
    };
}

I’ve provided on GitHub a fully functioning example.

Simply clone and run the application and notice that the memory usage of the application while generating 100k or even 1M records is fairly small and constant past the initial load.

Note: a nicer implementation of this would make use of a Visitor Pattern, in which a CsvVisitor would visit any T implementation that accepts such a visitor allowing even further decoupling between the objects being enumerated and the class doing the writing.

Two Approaches to Searching Users in Active Directory

Thu, Oct 18 2018

I’m sure there are more than two ways to perform searches against Active Directory, however I wanted to highlight two approaches: DirectorySearcher and PrincipalSearcher.

The former, DirectorySearcher comes from System.DirectoryServices and it’s the more “bare-metal” version of the two.

PrincipalSearcher, of System.DirectoryServices.AccountManagement provenance, is more of a query by example pattern and I’d say a higher level abstraction of directory searching.

To use DirectorySearcher, namely through it’s Filter property, one requires a bit more advance knowledge (or Googling skills) in order to decipher and employ the LDAP format filter string.

The payoff of using DirectorySearcher is the ability to construct complex query, including compound expressions across various objects: "(&(objectCategory=person)(objectClass=contact)(|(sn=Smith)(sn=Johnson)))" would find all contacts with a surname of Smith or Johnson.

However, for simple queries, the simplicity of PrincipalSearcher makes for easier to read code.

Consider the example of searching for all domain IDs (SAM account name) that begin with “john”:

var domain = "CORP";
var container = "DC=ad,DC=example,DC=com";

using(var context = new PrincipalContext(ContextType.Domain, domain, container)) {
    var principal = new UserPrincipal(context) {
        SamAccountName = "john*"
    };
    using(var searcher = new PrincipalSearcher(principal)) {
        PrincipalSearchResult<Principal> result = searcher.FindAll();
        result.Dump();
    }
}

Contrast with the same code using DirectorySearcher:

var ldapPath = "DC=corp,DC=ad,DC=example,DC=com";

using (var entry = new DirectoryEntry($"LDAP://{ldapPath}"))  {
    using(var searcher = new DirectorySearcher(entry)) {
        searcher.Filter = "(&(objectClass=user)(sAMAccountName=john*))";
        SearchResultCollection result = searcher.FindAll();
        result.Dump();
    }
}

Should we want to find a user with the last name being “Smith”, in the PrincipalSearcher case is as easy as setting the UserPrincipal’s Surname property - easily discoverable, whereas for the DirectorySearcher one would have to research and find out that the property is called, a bit more cryptical, sn.

What was also interesting to me is that perhaps owing to PrincipalSearcher formulating better criteria that I could, DirectorySearcher seems to be about 1.5-2x slower that the Principal version: whereas the former returns, in my attempts, in about 500ms, the directory searcher version takes 800-1,100ms for the same operation.

The type returned by the two methods is also another factor worth considering.

The SearchResult returned by the directory searcher method is sparse and all interaction is to be done through its Properties property, which is an implementation of System.Collections.DictionaryBase.
These properties are really LDAP properties and to get information out of a search result one needs to know what these properties represent – for example, knowing that “c” represent “country”, or “sn” is “surname”, or “cn” is “common name”.

In contrast, the UserPrincipal class offered by the PrincipalSearchResult<T> has more straighforward properties: Surname, GivenName, etc, although it might not have some of the properties stored in LDAP, for example the afore mentioned c = countryName.

Due to its more straightforward nature, I will be personally employing PrincipalSearcher for simple search queries and hope that I would never have to land in a case where I require the full power of the DirectorySearcher.

However, if I do - I now know what to search for.