How Can One Use Python and/or R for Summary Statistics and Machine Learning on Data Sets Too Big to Fit into Memory?

April 29, 2020

While dealing with large data sets, usually of the order of 400 million rows or gigabytes worth of data files, it’s natural to run out of memory capacity. For in-house data analysis, affording machines with enormous capacities can become a challenge for students and researchers. This doesn’t mean that average school or home computers can’t carry out large data calculations.

Data Science is an integral part of the Python Programming Training modules and everyone should be able to work on large data sets easily. Therefore, we have devised several ways by which this can be made possible.

Optimizing Calculations for Large Data Sets

Using GRIB Data Format

GRIB (GRIdded Binary) is a different data format than ASCII usually used in CSV and Excel files. GRIB takes up less space in the memory as it converts the data into a binary form.

This means that you can fit more data into the memory and have your machine calculations with a larger scope. GRIB is originally used by the Meteorological studies for weather forecast information, which in itself is a large data set.

In Python, PyGrib, and Pupygrib libraries can manage the conversion of ASCII to GRIB format easily. Whereas, in R you can use the ReadGrib function provided in grab reader.

Using Dask Library

Dask is a Python library that supports popular data analytics libraries, Numpy, and Pandas in Python.

Numpy and Pandas are great for summary statistics and machine learning models but they can’t scale up when the memory requirements go out of hand. Dask comes in as a helper library that provides scaling to multiple clusters, allowing aforementioned libraries to parallel process.

Dask is available in Python for free and you can put it to use for gaining significant results. Through user examples, we come to know that it can enable a 4GB RAM machine with multiple CPUs to carry out data analytics with ease.

Progressive Loading Techniques

Progressive loading refers to dissecting large data sets into smaller workable chunks that don’t overburden the host machine. There are several techniques used to carry this out.

One example that is often taught when you Learn Python Programming with us is Stochastic Gradient Descent. In this technique, a random sample from the data is used for iteration, which minimizes the memory costs.

This way you can achieve greater performance and lower drag on your machine. To fully understand the use of functions and algorithms used in SGD, you should sign-up the Python Programming Training at Imarticus.

Use a large Virtual Machine

While optimization techniques and libraries have a great impact on reducing memory consumption, sometimes the answer only lies in using a bigger computer for your calculations.

For that, you can use this alternative Virtual Machine that comes with the size of memory you’re aiming at. Just to give you an idea, you can use a machine on AWS with 64 GB of RAM and 8 vCPUs for about $12 a day. Which is quite cost-effective.

Conclusion

Large data sets are the gist of Data Science operations and an analyst must learn how to manage them efficiently. Our Python with Data Science and R courses cover many such fundamentals for aspiring Data professionals. The techniques described in this article are part of the curriculum.