How do you use the TABLESAMPLE clause to sample rows from a table?

Posted by KarenKg

Last Updated: July 16, 2024

TABLESAMPLE sampling

The TABLESAMPLE clause in SQL is used to return a random sample of rows from a table. It is commonly used in large datasets where retrieving all rows may be impractical or where only a representative sample is needed for analysis. The syntax and availability of the TABLESAMPLE clause can vary across different database systems, so it's important to refer to the specifics for the system you are using (e.g., SQL Server, PostgreSQL, etc.). Here's a general usage example for SQL Server and PostgreSQL:

Syntax for SQL Server:

SELECT *
FROM your_table
TABLESAMPLE (N PERCENT);

Example:

SELECT *
FROM employees
TABLESAMPLE (10 PERCENT);

This query would return a random sample of approximately 10% of the rows from the employees table. Note: - The number returned may not be exact since it is based on a random sampling technique. - The sampling can also be done using the number of rows:

Using the REPEATABLE option:

SELECT *
FROM your_table
TABLESAMPLE SYSTEM (10 ROWS) REPEATABLE (12345);

This would return a repeatable sample of 10 rows based on the seed value provided.

Syntax for PostgreSQL:

PostgreSQL uses TABLESAMPLE slightly differently. The sampling method you use can be BERNOULLI or SYSTEM.

Example using BERNOULLI:

SELECT *
FROM your_table
TABLESAMPLE BERNOULLI (10);

This would include approximately 10% of the rows randomly chosen, where each row has a 10% probability of being included.

Example using SYSTEM:

SELECT *
FROM your_table
TABLESAMPLE SYSTEM (5);

This returns a set of rows based on block-level sampling. This method can provide faster results, but the amount of data returned can vary more significantly compared to the BERNOULLI method.

Key Points to Remember:

- The specific sampling percentages (like 10%) are just examples; you can adjust this based on your analysis needs. - Not all databases support the TABLESAMPLE clause, so check the documentation for the database you are using. - The random nature of the sampling means that repeated executions can yield different results. Use a seed where applicable (e.g., in SQL Server) if repeatability is required. - If you're dealing with very large datasets, sampling can be a practical way to handle data quickly without overwhelming resource utilization.

Posted by KarenKg

Syntax for SQL Server:

Example:

Using the REPEATABLE option:

Syntax for PostgreSQL:

Example using BERNOULLI:

Example using SYSTEM:

Key Points to Remember:

Related Content

How do you use the TABLESAMPLE clause to retrieve a random sample of rows?

How do you use the TOP or LIMIT clause to restrict the number of returned rows?

How do you use the ROLLUP clause to add summary rows in a GROUP BY query?

How do you use the EXISTS clause to check for the existence of rows in a subquery?

How do you use the GROUP BY clause to group rows and apply aggregate functions?

How do you use the OUTPUT clause to return data from inserted, updated, or deleted rows?

How do you use the OUTPUT clause in an UPDATE statement to return updated rows?

How do you use the OUTPUT clause in an INSERT, UPDATE, or DELETE statement to return affected rows?

How do you implement a FULL OUTER JOIN to combine rows from two tables and include all rows from both sides?

How do you use the FULL OUTER JOIN to combine rows from two tables, including unmatched rows from both sides?

How do you use the HAVING clause in SQL?

How do you use the OFFSET clause without the FETCH NEXT clause for advanced pagination techniques?

How do you use the FOR XML AUTO clause to generate an XML representation of query results?

How do you use the CREATE INDEX statement with the PARTITION clause to create a partitioned index?