Each column name represents one discrete data element defined by a data type. The id column is of data type serial, a special integer type that auto-increments every time you add a row to the table.
After the INSERT INTO keywords is the name of the table, and in parentheses are the columns to be filled. In the next row is the VALUES keyword and the data to insert into each column in each row. You need to enclose the data for each row in a set of parentheses, and inside each set of parentheses, use a comma to separate each column value. The order of the values must also match the other of the columns specified after the table name. Each row of data ends with a comma, and the last row ends the entire statement with a semicolon. Text and dates require quotes, whereas numbers, including integers and decimals, don’t require quotes. The date format used is the international standard for date formats and using it will help you avoid confusion.
SELECT * fetches every row and column in a table. The * is a wildcard: a stand-in for a value and it represents everything that a value could possibly be.
Querying a Subset of Columns
%%sqlSELECT last_name, first_name, salary FROM employees;
You can fetch a limited number of columns by naming columns, separated by commas, right after the SELECT keyword. You’re able to retrieve columns in any order you like.
Finding Unique Values
To understand the range of values in a column, we can use the DISTINCT keyword as part of a query that eliminates duplicates and shows only unique values.
This technique gives us the ability to ask, “For each x in the table, what are all the y values? For e.g., for each department, what are all the salaries?
Sorting Data
We order the results of a query using a clause containing the keywords ORDER BY followed by the name of the column or columns to sort. Applying this clause doesn’t change the original table, only the result of the query.
ORDER BY sorts values in the ascending order (ASC) by default, but you can use the DESC keyword to sort in the descending order.
We’re not limited to sorting on just one column.
%%sqlSELECT first_name, last_name, department, hire_dateFROM employeesORDERBY department ASC, hire_date DESC;
The statement shows just the employees in the Sales department.
Comparison and Matching Operators
= Equal to <> or != Not equal to > Greater than < Less than >= Greater than or equal to <= Less than or equal to BETWEEN Within a range IN Match one of a set of values LIKE Match a pattern (case sensitive) ILIKE Match a pattern (case insensitive) NOT Negates a condition
Using LIKE and ILIKE with WHERE
Both operators let you search for patterns in strings by using two special characters:
- Percent sign (%): a wildcard matching one or more characters - Underscore(_): a wildcard matching just one character.
For e.g., if you’re trying to find the word sales, the following LIKE patterns will match it.
LIKE 's%'LIKE '%al%'LIKE '_ales'LIKE 'sa_es'
The only difference between the LIKE and ILIKE operators is that LIKE is case-sensitive and ILIKE is not.
Because we connect the two conditions using AND, both must be true for a row to meet the criteria in the WHERE clause and be returned in the query results. When we connect conditions using OR, only one of the conditions must be true for a row to meet the criteria of the WHERE clause. When we place statements inside parentheses, those are evaluated as a group before being combined with other criteria.
Character Data Types
char(n): A fixed-length column where the character length is specified by n. This is not used very often nowadays.
varchar(n): A variable length column where the maximum length is specified by n. If you insert fewer characters than the maximum, PostgreSQL will not store the extra spaces. The longer name for this data type is character varying(n).
text: a variable length column of unlimited length.
COPY char_data_types TO 'C:\YourDirectory\typetest.txt'WITH (FORMAT CSV, HEADER, DELIMITER '|');
We don’t always need to specify the column names with the INSERT INTO statement. If the VALUES statements match the number of columns in the table, the database will assume you’re inserting values in the order the column definitions were specified in the table.
Here, the COPY keyword exports the data to a text file named typetest.txt in a directory you specify. The directory must already exist; PostgreSQL won’t create it for you. COPY table_name FROM is the import function and COPY table_name TO is the export function.
Number Data Types
Integers
There are three integer types
- smallint: Storage size of 2 bytes with a range from −32768 to +32767.
- integer: Storage size of 4 bytes with a range from −2147483648 to +2147483647.
- bigint: Storage size of 8 bytes with a range from −9223372036854775808 to +9223372036854775807.
When the data values will remain constrained i.e. days of the month or year, smallint makes sense.
Auto-incrementing Integers
When you add a column with serial type, PostgreSQL will auto-increment the value in the column each time you insert a row, starting with 1, up to the maximum of each integer type.
- smallserial: Storage size of 2 bytes with a range from 1 to 32767 - serial: Storage size of 4 bytes with a range from 1 to 2147483647. - bigserial: Storage size of 8 bytes with a range from 1 to 9223372036854775807.
Note that when it comes to serial data types, if a row is deleted, the value in that row is never replaced. If a row is insert is aborted, the sequence for the column will still be incremented.
Decimal Numbers
Fixed-Point Numbers: Also called the arbitrary precision type, is
numeric(precision,scale): You give the argument precision as the maximum number of digits to the left and right of the decimal point, and the argument scale as the number of digits allowable on the right of the decimal point. Alternatively, you can specify this type using decimal(precision,scale). If you omit specifying a scale value, the scale will be set to zero, which creates an integer. If you omit specifying the precision and the scale, the database will store values of any precision and scale up to the maximum allowed.
Floating-Point Types: Also called variable-precision types, these are of two types
Notes: - Unless your data uses decimals, stick with integer types. - Floating-point types are referred to as “in-exact” and can lead to unintended mathematical errors. The storage required by the numeric data type is variable, and can consume considerably more space than the floating point types. - If you’re working with decimal data and need calculations to be exact (e.g. money, space travel, etc.), choose numeric/decimal. - When choosing a number type, err on the side of bigger.
Date and Time Data Types
timestamp: Records date and time. You can add the keywords with time zone to ensure that the time recorded for an event includes the time zone where it occurred. You can either use timestamp with time zone or just timestamptz.
date: Records just the date.
time: Records just the time.
interval: Holds a value representing a unit of time expressed in the format quantity unit. It doesn’t record the start or end of a time period, only its length. You’ll typically use this type for calculations or filtering on other date and time columns.
The International Organization for Standardization (ISO) format for dates and times is YYYY-MM-DD HH:MM:SS.
For the first row, we use the time zone abbreviation.
For the second row, we use the a value that represents the number of hours offset from the Coordinated Universal Time (UTC).
For the third row, we use the name of an area and location using values found in a standard time zone database.
For the fourth row, we use the now() function, which captures the current transaction time from your hardware.
Transforming Values from One Data Type to Another
The CAST() function lets you transform a value from its stored data type to another type, although it only succeeds when the target data type can accommodate the original value. For e.g., casting an integer as text is possible, but the reverse is not.
A shortcut notation for this function is ::. Insert the double colon in between the name of the column and the data type you want to convert it to.
A delimited text file contains rows of data, and each row represents one row in a table. In each row, a character separates, or delimits, each data column.
Columns that contain Delimiters
Sometimes the delimiter used might also be a piece of the data, for e.g. an address in a CSV file can use a comma as an input value. In such cases, delimited files wrap columns that contain a delimiter character with an arbitrary character called a text classifier that tells SQL to ignore the delimiter character help within. Most of the time in CSV files, the text qualifier used is the double quote. > John,Doe,"123 Main St., Apartment 200",Hyde Park,NY,845-555-1212
The database will recognize that the double quotes signify one column regardless of whether it finds a delimiter within the quotes.
Handling Header Rows
A header row is a single row at the top of the file that lists the name of each data field. Because PostgreSQL does not use the header row, we don’t want that row imported to a table, so we’ll use a HEADER option in the COPY command to exclude it.
Using COPY to Import Data
COPY table_name FROM 'C:\YourDirectory\your_file.csv' WITH (FORMAT CSV, HEADER);
The block of code starts with the COPY keyword followed by the name of the target table, which must already exist in your database. The FROM keyword identifies the full path to the source file, including its name. The WITH keyword lets you specify options, surrounded by parentheses, that you can tailor to your input or output file. Here we specify that the external file should be comma-delimited, and that we should exclude the file’s header row in the import.
Use the FORMAT format_name option to specify the type of file you’re reading or writing. Format names are CSV, TEXT, or BINARY. On import, use HEADER to specify that the source file has a header row. On export, using HEADER tells the database to include the column names as a header row in the output file. The DELIMITER 'character' option lets you specify which character your import or export file uses as a delimiter. If you use FORMAT CSV, the assumed delimiter is a comma.
Using COPY to Export Data
The main difference between exporting and importing data with COPY is that rather than using FROM to identify the source data, you use TO for the path and name of the output file.
Exporting All Data:
> COPY table_name TO 'C:\YourDirectory\your_file.csv' WITH (FORMAT CSV, HEADER, DELIMITER '|');
Exporting Particular Columns:
> COPY table_name (column_1, columns_2, column_3) TO 'C:\YourDirectory\your_file.csv' WITH (FORMAT CSV, HEADER, DELIMITER '|');
Exporting Query Results: > COPY ( SELECT column_1, column_2 FROM table_name WHERE column_1 ILIKE '%post%' TO 'C:\YourDirectory\your_file.csv' WITH (FORMAT CSV, HEADER, DELIMITER '|');
Math with SQL
Basic Math Operators
+ Addition - Subtraction * Multiplication / Division (returns only the quotient, no remainder) % Modulo (returns just the remainder) + Exponentiation |/ or sqrt(n) Square root ||/ Cube root factorial(n) Factorial (how many ways can a number of items be ordered?)
In calculations with an operator between two numbers - addition, subtraction, multiplication, and division - the data type returned follows this pattern: - Two integers return an integer. - A numeric on either side of the operator returns a numeric. - Anything with a floating-point number returns a floating-point number of type double precision.
However, the exponentiation, root, and factorial functions are different. Each takes one number either before or after the operator and returns numeric and floating-point types, even when the input is an integer. Sometimes the result’s data type will suit your needs; other times, you may need to use CAST to change the data type
The above table will give us the difference in the present and starting salary of the employees. The AS keyword is used to give a column a more readable alias.
Statistics with SQL
Finding Percentages of the Whole
To figure out the percentage of the whole, divide the number in question by the total, and then multiple by 100.
Tracking Percent Change
The formula to calculate percent change can be expressed like this:
((new number - old number) / old number) * 100
The round() function here removes all but one decimal place. The function takes two arguments: the column or expression to be rounded, and the number of decimal places to display.
Averages and Sums
We can use the avg() and sum() functions to find the averages and sums respectively.
The median is the middle value in an ordered set of values. Averages are less helpful when the values are bunched, or skewed, toward on end of the distribution, or if the group includes outliers. On such occasions, the median is a better indicator of the data set. A good test is to calculate the average and the median for a group of values. If they’re close, the group is probably normally distributed with a bell curve, and the average is useful. If they’re apart, the values are not normally distributed and the median is the better representation.
Percentiles
Percentiles indicate the point in an ordered set of data below which a certain percentage of the data is found. The median is equivalent to the 50th percentile. We can use the percentile function to find out the median as well as other quantiles. The percentile_cont(n) function calculates percentiles as continuous values, which means that the result does not have to the one of the numbers in the data set but can be a decimal value in between of the two numbers. The percentile_disc(n) function returns only discrete values, which will return a rounded number from the set.
%%sqlCREATETABLE percentile_test ( numbers integer);INSERTINTO percentile_test (numbers) VALUES (1), (2), (3), (4), (5), (6);SELECTpercentile_cont(.5) WITHIN GROUP (ORDERBY numbers),percentile_disc(.5) WITHIN GROUP (ORDERBY numbers)FROM percentile_test;
The JOIN statement links one table to another in the database during a query, using matching values in columns we specify in both tables. > SELECT * FROM table_a JOIN table_b ON table_a.key_column = table_b.foreign_key_column
This is similar to earlier SELECT statements, but instead of naming one table in the FROM clause, we name a table, give the JOIN keyword, and then name a second table. The ON keyword follows, where we specify the columns we want to use to match values.
You could also use the ON clause for an expression that evaluates to true or false. For e.g., you could match where values values from one column are greater than or equal to values in another column: > ON table_a.key_column >= table_b.foreign_key_column
In the departments table, the dept_id column is the table’s primary key. A primary key is a column or collection of columns whose values uniquely identify each row in a table. A valid primary key column enforces certain constraints: - The column or collection of columns must have a unique value for each row. The values only need to be unique within a table. - The column or collection of columns can’t have missing values.
You define the primary key for departments and employees using the CONSTRAINT keyword.
In the employees table, the emp_id column uniquely identifies each row, thereby making it the primary key. The dept_id column in the employees table refers to values in the departments table’s primary key. This is a foreign key, which you add as a constraint when creating a table. A foreign key constraint requires a value entered in a column to already exist in the primary key of the table it references. So, values in dept_id in the employees table must exists in dept_id in the departments table; otherwise you can’t add them. Unlike a primary key, a foreign key can be empty, and it can contain duplicate values.
Both tables also include a UNIQUE constraint, which guarantees that values in a column, or a combination of values in more than one column, are unique. In departments, it requires that each row have a unique pair of values for dept and city. In employees, each row must have a unique pair of emp_id and dept_id. You add these constraints to avoid duplicate data.
Querying multiple tables using JOIN
When you join tables in a query, the database connects rows in both tables where the columns you specified for the join have matching values. The query results then include columns from both tables if you requested them as part of the query. You also can use columns from the joined tables to filter results using a WHERE clause.
Let’s look at the simple JOIN ... ON syntax first.
In the above example, you include an asterisk to choose all columns from both tables. Next, the JOIN keyword goes between the two tables you want data from. Finally, you specify the columns to join the tables using the ON keyword. When you run the query, the results include all values from both tables where values in the dept_id column match.
JOIN Types
%%sqlCREATETABLE schools_left (idintegerCONSTRAINT left_id_key PRIMARYKEY, left_school varchar(30));CREATETABLE schools_right (idintegerCONSTRAINT right_id_key PRIMARYKEY, right_school varchar(30));INSERTINTO schools_left (id, left_school) VALUES (1, 'Oak Street School'), (2, 'Roosevelt High School'), (5, 'Washington Middle School'), (6, 'Jefferson High School');INSERTINTO schools_right (id, right_school) VALUES (1, 'Oak Street School'), (2, 'Roosevelt High School'), (3, 'Morrison Elementary'), (4, 'Chase Magnet Academy'), (6, 'Jefferson High School');
LEFT JOIN: Returns every row from the left table plus rows that match values in the joined column from the right table.
The LEFT JOIN and RIGHT JOIN keywords each return all rows from one table and display blank rows from the other table if no matching values are found in the joined columns.
FULL OUTER JOIN: Returns every row from both tables and matches rows; then joins the rows where values in the joined columns match. If there’s no match for a value in either the left or right table, the query result contains an empty row for the other table.
CROSS JOIN: The query returns each row in the left table with each row in the right table to present all possible combination of rows from both tables.
Definitely avoid this join on large tables.
To select particular columns, you use the SELECT keyword followed by the desired column names. When joining tables, you must include the column as well as its table name.
To create a table alias, we place a character or two after the table name when we declare it in the FROM clause. Those characters then serve as an alias we can use instead of the full table name.
%%sqlSELECT lt.id, lt.left_school, rt.right_schoolFROM schools_left AS lt LEFTJOIN schools_right AS rtON lt.id= rt.id;
In the above SELECT query, we join schools_left to schools_enrollment using the tables’ id fields. Next, the query joins schools_left to school_grades again on the id fields.
Find Rows with Missing Values with NULL
In SQL, NULL is a special value that represents a condition in which there’s no data present or where the data is unknown because it wasn’t included. You can use NULL across data types.
You can find empty rows by adding a WHERE clause to filter for NULL by using the phrase IS NULL. If you want to look for columns with data, use IS NOT NULL.
Note that in the column constraint syntax, you can omit the CONSTRAINT keyword and the name for the key, and simply use PRIMARY KEY.
Now let’s look at an example of a table constraint, where we declare the CONSTRAINT after listing the final column. with the column we want to use as they key in parentheses.
Note that you must use the table constraint syntax when you want to create a primary key using more than one column. In that case, you would list the columns in parentheses, separated by commas.
Composite Primary Key
A composite primary key is when you create a suitable key from a combination of columns.
We’ve already seen the three serial types: smallserial, serial, and bigserial. An easy way to create a surrogate primary key is with an auto-incrementing integer using one of the serial data types.
PostgreSQL also allows you store and compare UUID values but it does not include functions for generating the UUID values in its core. Instead, it relies on the third-party modules that provide specific algorithms to generate UUIDs.
To maintain referential integrity, the foreign key constraint prevents us from deleting a row from the licenses table above before removing any related rows in the registrations table, because doing so would leave an orphaned record. To delete a row in licenses and have that action automatically delete any related rows in registrations, we can specify that behavior by adding ON DELETE CASCADE when defining the foreign key constraint.
With the above query, deleting a row in licenses should also delete all related rows in registrations.
Constraints
The CHECK Constraint
A CHECK constraint evaluates whether data added to a column meets the expected criteria, which we specify with a logical test. We can implement it as a column or a table constraint.
For a column constraint, declare it in the CREATE TABLE statement after the column name and data type: CHECK (logical expression).
As a table constraint, use the syntax CONSTRAINTconstraint_nameCHECK (logical expression) after all columns are defined.
You can ensure that a column has a unique value in each row by using the UNIQUE constraint. The difference between this and a primary key is that a unique constraint permits multiple NULL values in a column.
You can remove a constraint or add one later to an existing table using ALTER TABLE, a command which makes changes to tables and columns. You can only add a constraint to an existing table if the data in the target column obeys the limits of the constraint.
To remove a primary key, foreign key, or a UNIQUE constraint: > ALTER TABLEtable_nameDROP CONSTRAINTconstraint_name;
To remove a NOT NULL constraint: > ALTER TABLEtable_nameALTER COLUMNcolumn_nameDROP NOT NULL;
The following statements can be executed one at a time.
The count() aggregate function is used to check the number of tows and perform other counting tasks. If we use an asterisk as an input (count(*)), the function returns the number of table rows regardless of whether they include null values.
When you use the GROUP BY clause with aggregate functions, you can group results according to the values in one or more columns. On its own, GROUP BY eliminates duplicate values from the results, similar to DISTINCT.
You can combine GROUP BY with an aggregate function like count() to pull more descriptive information form the data. For e.g., you can get the number of employees in each department. We ask for the values in the department column and a count of those values.
Note that when we select individual columns along with an aggregate function, we must include the columns in a GROUP BY clause. You can’t group values by aggregating and have ungrouped column values in the same query.
Using GROUP BY on Multiple Columns with count()
We can check the number of employees in each department and the starting salary at each department.
To filter results of aggregate functions, we need to use the HAVING clause. Aggregate functions, such as sum(), can’t be used within a WHERE clause because they operate at the row level, and aggregate functions work across rows. The HAVING clause places conditions on groups created by aggregating.
Concatenation combines two or more string or non-string values into one. We use UPDATE in conjunction with the double-pip string operator ||, which performs concatenation.
%%sqlUPDATE employeesSET department = department ||' dept';
We can use ALTER TABLE to modify the structure of tables.
> The code for adding a column: ALTER TABLE table_name ADD COLUMN column_name data_type;
The code to remove a column: ALTER TABLE table_name DROP COLUMN column_name;
The code to change the data type of a column: ALTER TABLE table_name ALTER COLUMN column_name SET DATA TYPE data_type;
The code to add a NOT NULL constraint to a column: ALTER TABLE table_name ALTER COLUMN column_name SET NOT NULL;
The code to remove a NOT NULL constraint: ALTER TABLE table_name ALTER COLUMN column_name DROP NOT NULL;
Modifying Values with UPDATE
The UPDATE statement modifies the data in a column in all rows or in a subset of rows that meet a condition. First we pass UPDATE the name of the table to update, and then pass the SET clause the column that contains the values to change. The new value to place in the column can be a string, number, the name of another column, or even a query or expression that generates a value.
UPDATE table_name SET column_name = value;
We can update values in multiple columns at a time by adding additional columns and source values, and separating each column and value statement with a comma. > UPDATE table_name SET column_a = value_1, column_b = value_2;
We can restrict the update to particular rows by adding a WHERE clause with some criteria that must be met before the update can happen. > UPDATE table_name SET column_a = value WHERE criteria;
We can also update one table with values from another table using a subquery, a query inside a query. > UPDATE table_a SET column = (SELECT column FROM table_b WHERE table_a.column = table_b.column) WHERE EXISTS (SELECT column FROM table_b WHERE table_a.column = table_b.column);
The value portion of the SET clause is a subquery, which is a SELECT statement inside parentheses that generates the value for the update. Similarly, the WHERE EXISTS clause uses a SELECT statement to generate values that serve as the filter for the update. If we didn’t use this clause, we might inadvertently set some values to NULL without planning to.
PostgreSQL also offers a simpler syntax using a FROM clause for updating values across tables. UPDATE table_a SET column = table_b.column FROM table_b WHERE table_a.column = table_b.column
Deleting Data
SQL includes options to remove rows and columns from a table along with options to delete an entire table or database. It’s easy to exclude unwanted data in queries using a WHERE clause, so decide whether you truly need to delete the data or can just filter it out. Cases where deleting may be the best solution include data with errors or data imported incorrectly.
Deleting Rows from a Table
Using a DELETE FROM statement, we can remove all rows from a table, or we can use a WHERE clause to delete only the portion that matches an expression we supply.
Delete all rows from a table: DELETE FROM table_name;
Delete selected rows: DELETE FROM table_name WHERE expression;
Deleting a Column from a Table
We can remove a column, including all data within the column, from a table by using the DROP keyword in the ALTER TABLE statement.
Remove a column: ALTER TABLE table_name DROP COLUMN column_name
Deleting a Table from a Database
Delete a table: DROP TABLE table_name;
Save or Revert Changes using Transaction Blocks
The only way to undo any changes after you run a DELETE or UPDATE query is to restore from a backup. However, you can check your changes before finalizing them and cancel the change if it’s not what you intended. You do this by wrapping the SQL statement within a transaction block, which is a group of statements you define using the following keywords at the beginning and end of the query:
START TRANSACTION signals the start of the transaction block. In PostgreSQL, you can also use the non-ANSI SQL BEGIN keyword.
COMMIT signals the end of the block and saves all changes.
ROLLBACK signals the end of the block and reverts all changes.
Usually, database programmers employ a transaction block in complex database systems to define the start and end of a sequence of operations that perform one unit of work in a database.
Improving Performance when Updating Large Tables
Adding a column to a table and filling it with values can quickly inflate the table’s size because the database creates a new version of the existing row each time a value is updated, and it doesn’t delete the old row version.
Instead of adding a column and filling it with values, we can save disk space by copying the entire table and adding a populated column during the operation. Then, we rename the tables so the copy replaces the original, and the original becomes a backup.
CREATE TABLE employees_backup ASSELECT *, '2021-01-12'::date AS end_dateFROM employees;ALTER TABLE employees RENAME TO employees_temp;ALTER TABLE employees_backup RENAME TO employees;ALTER TABLE employees_temp RENAME TO employees_backup;
Finding Ranks
We’ll focus on two rank functions, rank() and dense_rank(), which are both window functions that perform calculations across set of rows we specify using the OVER clause.
The difference between rank() and dense_rank() is the way they handle the next rank value after a tie: rank() includes a gap in the rank order, but dense_rank() does not.
%%sqlSELECT first_name, present_salary,rank() OVER (ORDERBY present_salary DESC),dense_rank() OVER (ORDERBY present_salary DESC)FROM employees;
Analysts often calculate a rate per 1,000 people, or some multiple of that number, for apples-to-apples comparisons. For e.g., if we’re trying to find the rate of crime.
Say x is the number of offenses and p is the total population, then the rate of offenses per 1,000 people is
(x / p) * 1000
select city, st, population, crime,round((crime::numeric / population) *1000, 1)as pc_per_1000from crime_datawhere population >=500000order by (crime::numeric / population) desc;