Data Management With Python - 1

26/1/2023 3-minute read

SQLite and SQLAlchemy in Python: Moving Your Data Beyond Flat Files

All programs process data in one form or another, and many need to be able to save and retrieve that data from one invocation to the next. Python, SQLite, and SQLAlchemy give your programs database functionality, allowing you to store data in a single file without the need for a database server.

  • Flat files for data storage

  • SQL to improve access to persistent data

  • SQLite for data storage

  • SQLAlchemy to work with data as Python objects

Flat Files for Data Storage:

Your software stores data in memory through the use of variables, but like life, this is fleeting: exit the program or turn off your computer, and the memory is gone.

This can be addressed through saving things out to some sort of storage, usually your disk drive. A side benefit of this is if it’s done well, it can be used to interact with more data than can be held in memory at any one time.

Flat file storage is a generic term for text-based files that typically can be read by a person. There are many different formats. Some common ones are CSV,JSON, and the granddaddy of angle brackets, XML.

Let’s go look at a program that uses Python’s csv module to see the kinds of things that can be done with a flat file format.

Sample CSV Data: can be downloaded from

Code to read the csv file and generate some summary information:

# flat_counts.py
from collections import defaultdict
import csv

authors = defaultdict(int)
publishers = defaultdict(int)

with open('path_to_the_csv_file') as csvfile:
  reader = csv.DictReader(csvfile)
  for row in reader:
  
    name = f"{row['first_name']} {row['last_name']}"
    
    authors[name] += 1
    publishers[row["publisher"]] += 1

print("Authors:")

for name, count in authors.items():
  print(f" {name}:{count}")

print("Publishers:")

for publisher, count in publishers.items():
  print(f" {publisher}: {count}")

The defaultdict is created with an integer, which means you can assume that a key exists, and if it didn’t, the defaultdict will create one automatically for you in the form of an int. I loop through each row of data, adding the author’s name and publisher’s name into the respective dictionaries to form a count of occurrences. After that, I print out a summary of how many instances of each author name and of each publisher.

A count for each author and a count for each publisher.

CSV has some limitations. For example, data is repeated. There are four books by Stephen King, and each get a row. That means Stephen’s name shows up four separate times.

This is because everything is from the rows’ perspective, which in this case is the book. If I want to add the age of an author, technically I can do it, but I’d have to store it on each of the four instances of King’s books.

If I need to make a change, I’d have to do that in all four places. Essentially, any data associated with the book is just that—it’s an attribute of the book. I can’t really interrelate different items easily.

For example, Richard Bachman is an alias of Stephen King’s, or detecting It has one author but two publishers—it’s messy. To detect any of these things, you’d have to write some custom code.

You probably know where this is leading. Relational databases to the rescue!

Which we will discuss in the next post.

Posts in this Series