New Method to Speed Up Data Sampling from Tables

Thursday - July 27 2023, 17:00 UTC - 2 years ago

tldr #

In a recent breakthrough, a POSTECH research team proposed a novel method for optimal sampling of data stored across various tables. This new technique called degree-based rejection sampling (DRS) managed to generate results rapidly with the help of techniques like generalized hypertree decompositions (GHDs). The acceptance of DRS and GHDs drastically reduces the time needed to join tables and acquire data for advanced AI applications.

content #

The world witnessed a monumental face-off between human intelligence and artificial intelligence in March 2016. The computer program AlphaGo honed its skills from a substantial database and emerged victorious against a human opponent in Go, a game renowned for its complexity in calculating countless possible moves.The importance of quality data for AI's continuous evolution is undeniable. AI has seamlessly integrated into such sectors as healthcare, finance, and education, while its advancements rely heavily on the availability of robust data for learning.

The game Go is considered one of the most complex games to calculate the possible moves from

Data is typically stored in distributed groups known as tables. For an AI to glean insights from these table-stored data, a "join" process is deployed to amalgamate these disparate tables into one comprehensive table. The sheer scale of this resultant table presents challenges in terms of storage, while the join process itself can be quite time-consuming. Even now, developing techniques for swift and uniform data sampling from tables remains a complex puzzle yet to be solved in data science.

This breakthrough was both the first time a paper from a Korean research team was presented at the 42-year-old ACM SIGMOD-SIGACT-SIGAI Symposium

In a significant breakthrough, a POSTECH research team led by Professor Wook-Shin Han (Graduate School of Artificial Intelligence) along with Ph.D. candidate Kyoungmin Kim (Department of Convergence IT Engineering) proposed a novel method for optimal sampling of data stored across various tables. This new technique managed to generate results rapidly.

The research was published as part of the Proceedings of the 42nd ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems(PODS 2023). This marked a momentous occasion as it was the first instance of a paper from a Korean research team being presented at this symposium in its 42-year history.

This is the first instance of a degree-based rejection sampling (DRS) being presented in the symposium

The researchers pioneered a method named degree-based rejection sampling (DRS), which falls under the umbrella of meta-sampling. Conventional methods necessitated pre-calculating probabilities for every value in the sample space before any value could be extracted directly. By contrast, the DRS method proposed by the team initiates with the extraction of a sample space with a simple probability distribution based on the degree of specific values, subsequently drawing values from this sample space.

The rapid data sampling was achieved by employing a technique known as generalized hypertree decompositions (GHDs) for quicker giveback of results

The team convincingly demonstrated that at least one sample space affords a greater probability than the elaborate probabilities computed via traditional methodologies for any random value that can be selected. This implies that values can be obtained with similar probabilities as traditional methods via rejection sampling. In this way, only the probability of extracting a sample space is merely multiplied as a constant value to the probability of sampling a value, avoiding complex probability calculations, and allowing for rapid data sampling.

The acceptance of DRS and GHDs drastically reduces the time needed to join tables and acquire data

Moreover, the team employed a technique known as generalized hypertree decompositions (GHDs) to extend the method, which involves analyzing a query in a tree format during the join procedure of integrating tables. If an entire query is processed using a singular join algorithm, it can lead to high time complexity, particularly when the query contains multiple join relations.

Using GHDs, the team managed to divide the entire query into subtrees and quickly integrate them with the help of the DRS method, allowing for faster data sampling and efficient computation overall when compared to the pre-existing join method.

Advanced AI applications require quality data, and with this method, AI can continue to evolve

hashtags #

ai datascience drs ghds symposium alphago

worddensity #

data (9, 1.76%)
sampling (7, 1.37%)
join (6, 1.18%)
team (6, 1.18%)
method (6, 1.18%)