How do you use the TABLESAMPLE clause to sample rows from a table?
Posted by KarenKg
Last Updated: July 16, 2024
The TABLESAMPLE clause in SQL is used to return a random sample of rows from a table. It is commonly used in large datasets where retrieving all rows may be impractical or where only a representative sample is needed for analysis. The syntax and availability of the TABLESAMPLE clause can vary across different database systems, so it's important to refer to the specifics for the system you are using (e.g., SQL Server, PostgreSQL, etc.). Here's a general usage example for SQL Server and PostgreSQL:
Syntax for SQL Server:
SELECT *
FROM your_table
TABLESAMPLE (N PERCENT);
Example:
SELECT *
FROM employees
TABLESAMPLE (10 PERCENT);
This query would return a random sample of approximately 10% of the rows from the employees table. Note: - The number returned may not be exact since it is based on a random sampling technique. - The sampling can also be done using the number of rows:
Using the REPEATABLE option:
SELECT *
FROM your_table
TABLESAMPLE SYSTEM (10 ROWS) REPEATABLE (12345);
This would return a repeatable sample of 10 rows based on the seed value provided.
Syntax for PostgreSQL:
PostgreSQL uses TABLESAMPLE slightly differently. The sampling method you use can be BERNOULLI or SYSTEM.
Example using BERNOULLI:
SELECT *
FROM your_table
TABLESAMPLE BERNOULLI (10);
This would include approximately 10% of the rows randomly chosen, where each row has a 10% probability of being included.
Example using SYSTEM:
SELECT *
FROM your_table
TABLESAMPLE SYSTEM (5);
This returns a set of rows based on block-level sampling. This method can provide faster results, but the amount of data returned can vary more significantly compared to the BERNOULLI method.
Key Points to Remember:
- The specific sampling percentages (like 10%) are just examples; you can adjust this based on your analysis needs. - Not all databases support the TABLESAMPLE clause, so check the documentation for the database you are using. - The random nature of the sampling means that repeated executions can yield different results. Use a seed where applicable (e.g., in SQL Server) if repeatability is required. - If you're dealing with very large datasets, sampling can be a practical way to handle data quickly without overwhelming resource utilization.
Related Content