PostgreSQL vs Python for Data Cleaning: A Guide

PostgreSQL vs Python for Data Cleaning: A Guide

Introduction

During analysis, you rarely—if ever—get to go directly from evaluating data to transforming and analyzing it. Sometimes to properly evaluate your data, you may need to do some pre-cleaning before you get to the main data cleaning, and that’s a lot of cleaning! In order to accomplish all this work, you may use Excel, R, or Python, but are these the best tools for data cleaning tasks?

In this blog post, I explore some classic data cleaning scenarios and show how you can perform them directly within your database using TimescaleDB and PostgreSQL, replacing the tasks that you may have done in Excel, R, or Python. TimescaleDB and PostgreSQL cannot replace these tools entirely, but they can help your data munging/cleaning tasks be more efficient and, in turn, let Excel, R, and Python shine where they do best: in visualizations, modeling, and machine learning.  

Cleaning is a very important part of the analysis process and generally can be the most grueling from my experience! By cleaning data directly within my database, I am able to perform a lot of my cleaning tasks one time rather than repetitively within a script, saving me considerable time in the long run.

A recap of the data analysis process

I began this series of posts on data analysis by presenting the following summary of the analysis process:

Image showing Evaluate -> Clean -> Transform -> Model, accompanied by icons which relate to each step
Data Analysis Lifecycle

The first three steps of the analysis lifecycle (evaluate, clean, transform) comprise the “data munging” stages of analysis. Historically, I have done my data munging and modeling all within Python or R, these being excellent options for analysis. However, once I was introduced to PostgreSQL and TimescaleDB, I found how efficient and fast it was to do my data munging directly within my database. In my previous post, I focused on showing data evaluation techniques and how you can replace tasks previously done in Python with PostgreSQL and TimescaleDB code. I now want to move on to the second step, data cleaning. Cleaning may not be the most glamorous step in the analysis process, but it is absolutely crucial to creating accurate and meaningful models.

As I mentioned in my last post, my first job out of college was at an energy and sustainability solutions company that focused on monitoring all different kinds of utility usage - such as electricity, water, sewage, you name it - to figure out how our clients’ buildings could be more efficient. My role at this company was to perform data analysis and business intelligence tasks.

Throughout my time in this job, I got the chance to use many popular data analysis tools including Excel, R, and Python. But once I tried using a database to perform my data munging tasks - specifically PostgreSQL and TimescaleDB - I realized how efficient and straightforward analysis, and particularly cleaning tasks, could be when done directly in a database.

Before using a database for data cleaning tasks, I would often find either columns or values that needed to be edited. I would pull the raw data from a CSV file or database, then make any adjustments to this data within my Python script. This meant that every time I ran my Python script, I would have to wait for my machine to spend computational time setting up and cleaning my data. This means that I lost time with every run of the script. Additionally, if I wanted to share cleaned data with colleagues, I would have to run the script or pass it along to them to run. This extra computational time could add up depending on the project.

Instead, with PostgreSQL, I can write a query to do this cleaning once and then store the results in a table. I wouldn’t need to spend time cleaning and transforming data again and again with a Python script, I could just set up the cleaning process in my database and call it a day! Once I started to make cleaning changes directly within my database, I was able to skip performing cleaning tasks within Python and simply focus on jumping straight into modeling my data.

To keep this post as succinct as possible, I chose to only show side-by-side code comparisons for Python and PostgreSQL. If you have any questions about other tools or languages, please feel free to join our Slack channel, where you can ask the Timescale community, or me, specific questions about Timescale or PostgreSQL functionality 😊. I’d love to hear from you!

Additionally, as we explore TimescaleDB and PostgreSQL functionality together, you may be eager to try things out right away! Which is awesome! The easiest way to get started is by signing up for a free 30-day trial of Timescale (if you prefer self-hosting, you can always install and manage TimescaleDB on your own PostgreSQL instances). Learn more by following one of our many tutorials.

Now, before we dip into things and get our data, as Outkast best put it, “So fresh, So clean”, I want to quickly cover the data set I will be using. In addition, I also want to note that all the code I show will assume you have some basic knowledge of SQL. If you are not familiar with SQL, don’t worry! In my last post, I included a section on SQL basics which you can find here.

About the sample dataset

In my experience within the data science realm, I have done the majority of my data cleaning after evaluation. However, sometimes it can be beneficial to clean data, evaluate, and then clean again. The process you choose is dependent on the initial state of your data and how easy it is to evaluate. For the data set I will use today, I would likely do some initial cleaning before evaluation and then clean again after, and I will show you why.

I got the following IoT data set from Kaggle, where a very generous individual shared their energy consumption readings from their apartment in San Jose CA, this data incrementing every 15 minutes. While this is awesome data, it is structured a little differently than I would like. The raw data set follows this schema:

Graphic showing the setup of the table. The tables name is 'energy_usage_staging'. each row contains the tables column and data types, the pairs of info are as follows ([type, text], [date, date], [start_time, time], [end_time, time], [usage, float4], [units,text], [cost, text], [notes, text])

and appears like this…

type date start_time end_time usage units cost notes
Electric usage 2016-10-22 00:00:00 00:14:00 0.01 kWh $0.00
Electric usage 2016-10-22 00:15:00 00:29:00 0.01 kWh $0.00
Electric usage 2016-10-22 00:30:00 00:44:00 0.01 kWh $0.00
Electric usage 2016-10-22 00:45:00 00:59:00 0.01 kWh $0.00
Electric usage 2016-10-22 01:00:00 01:14:00 0.01 kWh $0.00
Electric usage 2016-10-22 01:15:00 01:29:00 0.01 kWh $0.00
Electric usage 2016-10-22 01:30:00 01:44:00 0.01 kWh $0.00
Electric usage 2016-10-22 01:45:00 01:59:00 0.01 kWh $0.00


In order to do any type of analysis on this data set, I want to clean it up. A few things that quickly come to mind include:

  • The cost is seen as a text data type which will cause some issues.
  • The time columns are split apart which could cause some problems if I want to create plots over time or perform any type of modeling based on time.
  • I may also want to filter the data based on various parameters that have to do with time, such as day of the week or holiday identification (both potentially play into how energy is used within the household).

In order to fix all of these things and get more valuable data evaluation and analysis, I will have to clean the incoming data! So without further ado, let’s roll up our sleeves and dig in!

Cleaning the data

I will show most of the techniques I have used in the past while working in data science. While these examples are not exhaustive, I hope they will cover many of the cleaning steps you perform during your own analysis, helping to make your cleaning tasks more efficient by using PostgreSQL and TimescaleDB.

Please feel free to explore these various techniques and skip around if you need! There is a lot here, and I designed it to be a helpful glossary of tools that you could use as you need.

The techniques that I will cover include:

Note on cleaning approach:

There are many ways that I could approach the cleaning process in PostgreSQL. I could create a table then ALTER it as I clean, I could create multiple tables as I add or change data, or I could work with VIEWs. Depending on the size of my data, any of these approaches could make sense, however, they will have different computational consequences.

You may have noticed above that my raw data table was called energy_usage_staging. This is because I decided that given the state of my raw data, it would be best for me to place the raw data in a staging table, clean it using VIEWs, then insert it into a more usable table as part of my cleaning process. This move from raw table to the usable table could happen even before the evaluation step of analysis. As I discussed above, sometimes data cleaning has to occur after AND before evaluating your data. Regardless, this data needs to be cleaned and I wanted to use the most efficient method possible. In this case, that meant using a staging table and leveraging the efficiency and power of PostgreSQL VIEWs, something I will talk about later.

Generally, if you are dealing with a lot of data, altering an existing table in PostgreSQL can be costly. For this post, I will show you how to build up clean data using VIEWs along with additional tables. This method of cleaning is more efficient and sets you up for the next blog post about data transformation which includes the use of scripts in PostgreSQL.

Correcting structural issues

Right off the bat, I know that I need to do some data refactoring on my raw table due to data types. Notice that we have date and time columns separated and costs is recorded as a text data type. I need to convert my separated date time columns to a timestamp and the cost column to float4. But before I show that, I want to talk about why conversion to timestamp is beneficial.

TimescaleDB hypertables and why timestamp is important

For those of you not familiar with the structure of TimescaleDB hypertables, they are at the basis of how we efficiently query and manipulate time-series data. Timescale hypertables are partitioned based on time, and more specifically by the time column you specify upon creation of the table.

The data is partitioned by timestamp into "chunks" so that every row in the table belongs to some chunk based on a time range. We then use these time chunks to help query the rows so that you can get more efficient querying and data manipulation based on time. This image represents the difference between a normal table and our special hypertables.

Graphic showing a normal table vs a hypertable. The normal table just shows data in a table. The hypertable shows data in the table, but it also shows the data being "grouped" or "chunked" by day. By adding an index like structure based on time, queries can be more efficient.

Changing date-time structure

Because I want to utilize TimescaleDB functionality to the fullest, such as continuous aggregates and faster time based queries, I want to restructure the energy_usage_staging table's date and time columns. I could use the date column for my hypertable partitioning, however, I would have limited control over manipulating my data based on time. It is more flexible and space efficient to have a single column with a timestamp than it is to have separate columns with date and time. I can always extract the date or time from the timestamp if I want to later!  

Looking back at the table structure, I should be able to get a usable timestamp value from the date and start_time columns as the end_time really doesn’t give me that much useful information. Thus, I want to essentially combine these two columns to form a new timestamp column, let’s see how I can do that using SQL. Spoiler alert, it is as simple as an algebraic statement. How cool is that?!

PostgreSQL code:

In PostgreSQL I can create the column without inserting it into the database just yet. Since I want to create a NEW table from this staging one, I don’t want to add more columns or tables just yet.

Let’s first compare the original columns with our new generated column. For this query I simply add the two columns together. The AS keyword just allows me to rename the column to whatever I would like, in this case being time.

--add the date column to the start_time column
SELECT date, start_time, (date + start_time) AS time 
FROM energy_usage_staging eus;

Results:

date start_time time
2016-10-22 00:00:00 2016-10-22 00:00:00.000
2016-10-22 00:15:00 2016-10-22 00:15:00.000
2016-10-22 00:30:00 2016-10-22 00:30:00.000
2016-10-22 00:45:00 2016-10-22 00:45:00.000
2016-10-22 01:00:00 2016-10-22 01:00:00.000
2016-10-22 01:15:00 2016-10-22 01:15:00.000

Python code:

In Python, the easiest way to do this is to add a new column to the dataframe. Notice that in Python I would have to concatenate the two columns along with a defined space, then convert that column to datetime.

energy_stage_df['time'] = pd.to_datetime(energy_stage_df['date'] + ' ' + energy_stage_df['start_time'])
print(energy_stage_df[['date', 'start_time', 'time']])

Changing column data types

Next, I want to change the data type of my cost column from text to float. Again, this is straightforward in PostgreSQL with the TO_NUMBER() function.

The format of the function is as follows: TO_NUMBER(‘text’, ‘format’) . The ‘format’ input is a PostgreSQL specific string that you can build depending on what type of text you want to convert. In our case we have a $ symbol followed by a numeric set up 0.00. For the format string I decided to use ‘L99D99’. The L lets PostgreSQL know there is a money symbol at the beginning of the text, the 9s let the system know I have numeric values, and then the D stands for a decimal point.

I decided to cap the conversion on values that would be less than or equal to ‘$99.99’ because the cost column has no values greater than 0.65. If you were planning to convert a column with larger numeric values, you would want to account for that by adding in a G for commas. For example, say you have a cost column with text values like this ‘$1,672,278.23’ then you would want to format the string like this ‘L9G999G999D99’

PostgreSQL code:

--create a new column called cost_new with the to_number() function
SELECT cost, TO_NUMBER("cost", 'L9G999D99') AS cost_new
FROM energy_usage_staging eus  
ORDER BY cost_new DESC

Results:

cost cost_new
$0.65 0.65
$0.65 0.65
$0.65 0.65
$0.57 0.57
$0.46 0.46
$0.46 0.46
$0.46 0.46
$0.46 0.46

Python code:

For Python, I used a lambda function that systematically replaces all the ‘$’ signs with empty strings. This can be fairly inefficient.

energy_stage_df['cost_new'] = pd.to_numeric(energy_stage_df.cost.apply(lambda x: x.replace('$','')))
print(energy_stage_df[['cost', 'cost_new']])

Creating a VIEW

Now that I know how to convert my columns, I can combine the two queries and create a VIEW of my new restructured table. A VIEW is a PostgreSQL object which allows you to define a query and call it by it’s VIEWs name, as if it were a table within your database. I can use the following query to generate the data I want and then create a VIEW that I can query it as if it were a table.

PostgreSQL code:

-- query the right data that I want
SELECT type, 
(date + start_time) AS time, 
"usage", 
units, 
TO_NUMBER("cost", 'L9G999D99') AS cost, 
notes 
FROM energy_usage_staging

Results:

type time usage units cost notes
Electric usage 2016-10-22 00:00:00.000 0.01 kWh 0.00
Electric usage 2016-10-22 00:15:00.000 0.01 kWh 0.00
Electric usage 2016-10-22 00:30:00.000 0.01 kWh 0.00
Electric usage 2016-10-22 00:45:00.000 0.01 kWh 0.00
Electric usage 2016-10-22 01:00:00.000 0.01 kWh 0.00
Electric usage 2016-10-22 01:15:00.000 0.01 kWh 0.00
Electric usage 2016-10-22 01:30:00.000 0.01 kWh 0.00
Electric usage 2016-10-22 01:45:00.000 0.01 kWh 0.00
Electric usage 2016-10-22 02:00:00.000 0.02 kWh 0.00
Electric usage 2016-10-22 02:15:00.000 0.02 kWh 0.00

I decided to call my VIEW energy_view. Now, when I want to do further cleaning, I can just specify its name in the FROM statement.

--create view from the query above
CREATE VIEW energy_view AS
SELECT type, 
(date + start_time) AS time, 
"usage", 
units, 
TO_NUMBER("cost", 'L9G999D99') AS cost, 
notes 
FROM energy_usage_staging

Python code:

energy_df = energy_stage_df[['type','time','usage','units','cost_new','notes']]
energy_df.rename(columns={'cost_new':'cost'}, inplace = True)
print(energy_df.head(20))

It is important to note that with PostgreSQL VIEWs, the data inside of them have to be recalculated every time you query it. This is why we want to insert our VIEW data into a hypertable once we have the data set up just right. You can think of VIEWs as a shorthand version of the CTEs WITH AS statement I discussed in my last post.

We are now one step closer to cleaner data!

Creating or generating relevant data

With some quick investigation, we can see that the notes column is blank for this data set. To check this I just need to include a WHERE clause and specify where notes are not equal to an empty string.

PostgreSQL code:

SELECT * 
FROM energy_view ew
-- where notes are not equal to an empty string
WHERE notes!='';

Results come out empty

Python code:

print(energy_df[energy_df['notes'].notnull()])

Since the notes are blank, I would like to replace the column with various sets of additional information that I could use later on during modeling. One thing I would like to add in particular, is a column that specifies the day of the week. To do this I can use the EXTRACT() command. The EXTRACT() command is a PostgreSQL date/time function that allows you to extract various date/time elements. For our column, PostgreSQL has the specification DOW (day-of-week) which maps 0 to Sunday through to 6 for Saturday.

PostgreSQL code:

--extract day-of-week from date column and cast the output to an int
SELECT *,
EXTRACT(DOW FROM time)::int AS day_of_week
FROM energy_view ew

Results:

type time usage units cost notes day_of_week
Electric usage 2016-10-22 00:00:00.000 0.01 kWh 0.00 6
Electric usage 2016-10-22 00:15:00.000 0.01 kWh 0.00 6
Electric usage 2016-10-22 00:30:00.000 0.01 kWh 0.00 6
Electric usage 2016-10-22 00:45:00.000 0.01 kWh 0.00 6
Electric usage 2016-10-22 01:00:00.000 0.01 kWh 0.00 6
Electric usage 2016-10-22 01:15:00.000 0.01 kWh 0.00 6

Python code:

energy_df['day_of_week'] = energy_df['time'].dt.dayofweek

Additionally, we may want to add another column that specifies if a day occurs over a weekend or weekday. I will do this by creating a boolean column, where true represents a weekend, and false represents a weekday. To do this, I will apply a CASE statement. With this command I can specify “when-then” statements (similar to “if-then” statements in coding) where I can say WHEN a day_of_week value is IN the set (0,6) THEN the output should be true, ELSE the value should be false.

PostgreSQL code:

SELECT type, time, usage, units, cost,
EXTRACT(DOW FROM time)::int AS day_of_week, 
--use the case statement to make a column true when records fall on a weekend aka 0 and 6
CASE WHEN (EXTRACT(DOW FROM time)::int) IN (0,6) then true
	ELSE false
END AS is_weekend
FROM energy_view ew

Results:

type time usage units cost day_of_week is_weekend
Electric usage 2016-10-22 00:00:00.000 0.01 kWh 0.00 6 true
Electric usage 2016-10-22 00:15:00.000 0.01 kWh 0.00 6 true
Electric usage 2016-10-22 00:30:00.000 0.01 kWh 0.00 6 true
Electric usage 2016-10-22 00:45:00.000 0.01 kWh 0.00 6 true
Electric usage 2016-10-22 01:00:00.000 0.01 kWh 0.00 6 true

Fun fact: you can do the same query without a CASE statement, however it only works for binary columns.

--another method to create a binary column
SELECT type, time, usage, units, cost,
EXTRACT(DOW FROM time)::int AS day_of_week, 
EXTRACT(DOW FROM time)::int IN (0,6) AS is_weekend
FROM energy_view ew

Python code:

Notice that in Python, the weekends are represented by numbers 5 and 6 vs the PostgreSQL weekend values 0 and 6.

energy_df['is_weekend'] = np.where(energy_df['day_of_week'].isin([5,6]), 1, 0)
print(energy_df.head(20))

And maybe things then start getting real crazy, maybe you want to add more parameters!

Let’s consider holidays. Now you may be asking “Why in the world would we do that?!”, but often people have time off during some of the holidays within the US. Since this individual lives within the US, they likely have at least some of the holidays off whether they are the day of OR a federal holiday. Where there are days off, there could be a difference in energy usage. To help guide my analysis, I want to include the identification of holidays. To do this, I’m going to create another boolean column that identifies when a federal holiday occurs.

To do this, I am going to use TimescaleDB’s time_bucket() function. The time_bucket() function is one of the functions I discussed in detail within my previous post. Essentially, I need to use this function to make sure all time values within a single day get accounted for. Without using the time_bucket() function, I would only see changes to the row associated with the 12am time period.

PostgreSQL code:

After I create a holiday table, I can then use the data from it within my query. I also decided to use the non-case syntax for this query. Note that you can use either!

--create table for the holidays
CREATE TABLE holidays (
date date)

--insert the holidays into table
INSERT INTO holidays 
VALUES ('2016-11-11'), 
('2016-11-24'), 
('2016-12-24'), 
('2016-12-25'), 
('2016-12-26'), 
('2017-01-01'),  
('2017-01-02'), 
('2017-01-16'), 
('2017-02-20'), 
('2017-05-29'), 
('2017-07-04'), 
('2017-09-04'), 
('2017-10-9'), 
('2017-11-10'), 
('2017-11-23'), 
('2017-11-24'), 
('2017-12-24'), 
('2017-12-25'), 
('2018-01-01'), 
('2018-01-15'), 
('2018-02-19'), 
('2018-05-28'), 
('2018-07-4'), 
('2018-09-03'), 
('2018-10-8')

SELECT type, time, usage, units, cost,
EXTRACT(DOW FROM time)::int AS day_of_week, 
EXTRACT(DOW FROM time)::int IN (0,6) AS is_weekend,
-- I can then select the data from the holidays table directly within my IN statement
time_bucket('1 day', time) IN (SELECT date FROM holidays) AS is_holiday
FROM energy_view ew

Results:

type time usage units cost day_of_week is_weekend is_holiday
Electric usage 2016-10-22 00:00:00.000 0.01 kWh 0.00 6 true false
Electric usage 2016-10-22 00:15:00.000 0.01 kWh 0.00 6 true false
Electric usage 2016-10-22 00:30:00.000 0.01 kWh 0.00 6 true false
Electric usage 2016-10-22 00:45:00.000 0.01 kWh 0.00 6 true false
Electric usage 2016-10-22 01:00:00.000 0.01 kWh 0.00 6 true false
Electric usage 2016-10-22 01:15:00.000 0.01 kWh 0.00 6 true false

Python code:

holidays = ['2016-11-11', '2016-11-24', '2016-12-24', '2016-12-25', '2016-12-26', '2017-01-01',  '2017-01-02', '2017-01-16', '2017-02-20', '2017-05-29', '2017-07-04', '2017-09-04', '2017-10-9', '2017-11-10', '2017-11-23', '2017-11-24', '2017-12-24', '2017-12-25', '2018-01-01', '2018-01-15', '2018-02-19', '2018-05-28', '2018-07-4', '2018-09-03', '2018-10-8']
energy_df['is_holiday'] = np.where(energy_df['day_of_week'].isin(holidays), 1, 0)
print(energy_df.head(20))

At this point, I’m going to save this expanded table into another VIEW so that I can call the data without writing out the query.

PostgreSQL code:

--create another view with the data from our first round of cleaning
CREATE VIEW energy_view_exp AS
SELECT type, time, usage, units, cost,
EXTRACT(DOW FROM time)::int AS day_of_week, 
EXTRACT(DOW FROM time)::int IN (0,6) AS is_weekend,
time_bucket('1 day', time) IN (select date from holidays) AS is_holiday
FROM energy_view ew

You may be asking, “Why did you create these as boolean columns??”, a very fair question! You see, I may want to use these columns for filtering during analysis, something I commonly do during my own analysis process. In PostgreSQL, when you use boolean columns you can filter things super easily. For example, say that I want to use my table query so far and show only the data that occurs over the weekend AND a holiday. I can do this simply by adding in a WHERE statement along with the specified columns.

PostgreSQL code:

--if you use binary columns, then you can filter with a simple WHERE statement
SELECT *
FROM energy_view_exp
WHERE is_weekend AND is_holiday

Results:

type time usage units cost day_of_week is_weekend is_holiday
Electric usage 2016-12-24 00:00:00.000 0.34 kWh 0.06 6 true true
Electric usage 2016-12-24 00:15:00.000 0.34 kWh 0.06 6 true true
Electric usage 2016-12-24 00:30:00.000 0.34 kWh 0.06 6 true true
Electric usage 2016-12-24 00:45:00.000 0.34 kWh 0.06 6 true true
Electric usage 2016-12-24 01:00:00.000 0.34 kWh 0.06 6 true true
Electric usage 2016-12-24 01:15:00.000 0.34 kWh 0.06 6 true true

Python code:

print(energy_df[(energy_df['is_weekend']==1) & (energy_df['is_holiday']==1)].head(10))

Adding data to a hypertable

Now that I have new columns ready to go and I know how I would like my table to be structured, I can create a new hypertable and insert my cleaned data. In my own analysis with this data set, I may have done the cleaning up to this point BEFORE evaluating my data so that I can get a more meaningful evaluation step in analysis. What’s great is that you can use any of these techniques for general cleaning, whether that is before or after evaluation.

PostgreSQL:



CREATE TABLE energy_usage (
type text,
time timestamptz,
usage float4,
units text,
cost float4,
day_of_week int,
is_weekend bool,
is_holiday bool,
) 

--command to create a hypertable
SELECT create_hypertable('energy_usage', 'time')

INSERT INTO energy_usage 
SELECT *
FROM energy_view_exp

Note that if you had data continually coming in you could create a script within your database that automatically makes these changes when importing your data. That way you can have cleaned data ready to go in your database rather than processing and cleaning the data in your scripts every time you want to perform analysis.

We will discuss this in detail in my next post, so make sure to stay tuned in if you want to know how to create scripts and keep data automatically updated!

Renaming values

Another valuable technique for cleaning data is being able to rename various items or remap categorical values. The importance of this skill is amplified by the popularity of this Python data analysis question on StackOverflow. The question states “How do I change a single index value in a pandas dataframe?”. Since PostgreSQL and TimescaleDB use relational table structures, renaming unique values can be fairly simple.

When renaming specific index values within a table, you can do this “on the fly” by using PostgreSQL’s CASE statement within the SELECT query. Let’s say I don’t like Sunday being represented by a 0 in the day_of_week column, but would prefer it to be a 7. I can do this with the following query.

PostgreSQL code:

SELECT type, time, usage, cost, is_weekend,
-- you can use case to recode column values 
CASE WHEN day_of_week = 0 THEN 7
ELSE day_of_week 
END
FROM energy_usage

Python code:

Caveat, this code would make Monday = 7 because the python DOW function has Monday set to 0 and Sunday set to 6. But this is how you would update one value within a column. Likely you would not want to do this exact action, I just wanted to show the python equivalent for reference.

energy_df.day_of_week[energy_df['day_of_week']==0] = 7
print(energy_df.head(250))

Now, let’s say that I wanted to actually use the names of the days of the week instead of showing numeric values? For this example, I actually want to ditch the CASE statement and create a mapping table. When you need to change various values, it will likely be more efficient to create a mapping table and then join to this table using the JOIN command.

PostgreSQL:

--first I need to create the table
CREATE TABLE day_of_week_mapping (
day_of_week_int int,
day_of_week_name text
)

--then I want to add data to my table
INSERT INTO day_of_week_mapping
VALUES (0, 'Sunday'),
(1, 'Monday'),
(2, 'Tuesday'),
(3, 'Wednesday'),
(4, 'Thursday'),
(5, 'Friday'),
(6, 'Saturday')

--then I can join this table to my cleaning table to remap the days of the week
SElECT type, time, usage, units, cost, dowm.day_of_week_name, is_weekend
FROM energy_usage eu
LEFT JOIN day_of_week_mapping dowm ON dowm.day_of_week_int = eu.day_of_week

Results:

type time usage units cost day_of_week_name weekend
Electric usage 2018-07-22 00:45:00.000 0.1 kWh 0.03 Sunday true
Electric usage 2018-07-22 00:30:00.000 0.1 kWh 0.03 Sunday true
Electric usage 2018-07-22 00:15:00.000 0.1 kWh 0.03 Sunday true
Electric usage 2018-07-22 00:00:00.000 0.1 kWh 0.03 Sunday true
Electric usage 2018-02-11 23:00:00.000 0.04 kWh 0.01 Sunday true

Python:

In this case, python has similar mapping functions.

energy_df['day_of_week_name'] = energy_df['day_of_week'].map({0 : 'Sunday', 1 : 'Monday', 2: 'Tuesday', 3: 'Wednesday', 4: 'Thursday', 5: 'Friday', 6: 'Saturday'})
print(energy_df.head(20))

Hopefully, one of these techniques will be useful for you as you approach data renaming!

Additionally, remember that if you would like to change the name of a column in your table, it is truly as easy as AS (I couldn’t not use such a ridiculous statement 😂). When you use the SELECT statement, you can rename your columns like so,

PostgreSQL code:

SELECT type AS usage_type,
time as time_stamp,
usage,
units, 
cost AS dollar_amount
FROM energy_view_exp
LIMIT 20;

Results:

usage_type time_stamp usage units dollar_amount
Electric usage 2016-10-22 00:00:00.000 0.01 kWh 0.00
Electric usage 2016-10-22 00:15:00.000 0.01 kWh 0.00
Electric usage 2016-10-22 00:30:00.000 0.01 kWh 0.00
Electric usage 2016-10-22 00:45:00.000 0.01 kWh 0.00

Python code:

Comparatively, renaming columns in Python can be a huge pain. This is an area where SQL is not only faster, but also just more elegant in its code.

energy_df.rename(columns={'type':'usage_type', 'time':'time_stamp', 'cost':'dollar_amount'}, inplace=True)
print(energy_df[['usage_type','time_stamp','usage','units','dollar_amount']].head(20))

Fill in missing data

Another common problem in the data cleaning process is having missing data. For the dataset we are using, there are no obviously missing data points, however, it is very possible that with evaluation, we could find missing hourly data from a power outage or some other phenomenon. This is where the gap-filling functions TimescaleDB offers could come in handy. When using algorithms, missing data can often have significant negative impacts on the accuracy or dependability of the model. Sometimes, you can navigate this problem by filling in missing data with reasonable estimates and TimescaleDB actually has built-in functions to help you do this.

For example, let’s say that you are modeling the energy usage over individual days of the week and a handful of days have missing energy data due to a power outage or an issue with the sensor. We could remove the data, or try to fill in the missing values with reasonable estimations. For today, let’s assume that the model I want to use would benefit more from filling in the missing values.

As an example, I created some data. I called this table energy_data and it is missing both time and energy readings for the timestamps between 7:45am and 11:30am.

time energy
2021-01-01 07:00:00.000 0
2021-01-01 07:15:00.000 0.1
2021-01-01 07:30:00.000 0.1
2021-01-01 07:45:00.000 0.2
2021-01-01 11:30:00.000 0.04
2021-01-01 11:45:00.000 0.04
2021-01-01 12:00:00.000 0.03
2021-01-01 12:15:00.000 0.02
2021-01-01 12:30:00.000 0.03
2021-01-01 12:45:00.000 0.02
2021-01-01 13:00:00.000 0.03

I can use TimescaleDB’s gapfilling hyperfunctions to fill in these missing values. The interpolate() function is another one of TimescaleDB’s hyperfunctions and it creates data points that follow a linear approximation given the data points before and after the missing range of data. Alternatively, you could use the locf() hyperfunction which carries the last recorded value forward to fill in the gap (note that locf stands for last-one-carried-forward). Both of these functions must be used in conjunction with the time_bucket_gapfill() function.

PostgreSQL code:

SELECT
--here I specified that the data should increment by 15 mins
  time_bucket_gapfill('15 min', time) AS timestamp,
  interpolate(avg(energy)),
  locf(avg(energy))
FROM energy_data
--to use gapfill, you will have to take out any time data associated with null values. You can do this using the IS NOT NULL statement
WHERE energy IS NOT NULL AND time > '2021-01-01 07:00:00.000' AND time < '2021-01-01 13:00:00.000'
GROUP BY timestamp
ORDER BY timestamp;

Results:

timestamp interpolate locf
2021-01-01 07:00:00.000 0.1 0.10000000000000000000
2021-01-01 07:30:00.000 0.15 0.15000000000000000000
2021-01-01 08:00:00.000 0.13625 0.15000000000000000000
2021-01-01 08:30:00.000 0.1225 0.15000000000000000000
2021-01-01 09:00:00.000 0.10875 0.15000000000000000000
2021-01-01 09:30:00.000 0.095 0.15000000000000000000
2021-01-01 10:00:00.000 0.08125 0.15000000000000000000
2021-01-01 10:30:00.000 0.0675 0.15000000000000000000
2021-01-01 11:00:00.000 0.05375 0.15000000000000000000
2021-01-01 11:30:00.000 0.04 0.04000000000000000000
2021-01-01 12:00:00.000 0.025 0.02500000000000000000
2021-01-01 12:30:00.000 0.025 0.02500000000000000000

Python code:

energy_test_df['time'] = pd.to_datetime(energy_test_df['time'])
energy_test_df_locf = energy_test_df.set_index('time').resample('15 min').fillna(method='ffill').reset_index()
energy_test_df = energy_test_df.set_index('time').resample('15 min').interpolate().reset_index()
energy_test_df['locf'] = energy_test_df_locf['energy']
print(energy_test_df)

Bonus:

The following query is how I could ignore the missing data. I wanted to include this to show you just how easy it can be to exclude null data. Alternatively, I could use a WHERE clause to specify the times which I could like to ignore (the second query).

SELECT * 
FROM energy_data 
WHERE energy IS NOT NULL

SELECT * 
FROM energy_data
WHERE time <= '2021-01-01 07:45:00.000' OR time >= '2021-01-01 11:30:00.000'

Wrap Up

After reading through these various cleaning techniques, I hope you feel more comfortable with exploring some of the possibilities that PostgreSQL and TimescaleDB provide. By cleaning data directly within my database, I am able to perform a lot of my cleaning tasks a single time rather than repetitively within a script, thus saving me time in the long run. If you are looking to save time and effort while cleaning your data for analysis, definitely consider using PostgreSQL and TimescaleDB.

In my next posts, I will go over techniques on how to transform data using PostgreSQL and TimescaleDB. I'll then take everything we've learned together to benchmark data munging tasks in PostgreSQL and TimescaleDB vs. Python and pandas. The final blog post will walk you through the full process on a real dataset by conducting a deep-dive into data analysis with TimescaleDB (for data munging) and Python (for modeling and visualizations).

If you have questions about TimescaleDB, time-series data, or any of the functionality mentioned above, join our community Slack, where you'll find an active community of time-series enthusiasts and various Timescale team members (including me!).

If you’re ready to see the power of TimescaleDB and PostgreSQL right away, you can sign up for a free 30-day trial or install TimescaleDB and manage it on your current PostgreSQL instances. We also have a bunch of great tutorials to help get you started.

Until next time!

Functionality Glossary:

Ingest and query in milliseconds, even at terabyte scale.
This post was written by
24 min read
Blog
Contributors

Related posts