There’s data, and then there’s big data. So, what’s the difference?
Big data in general refers to sets of data that are so large in volume and so complex that traditional data processing software products are not capable of capturing, managing, and processing the data within a reasonable amount of time.
These big data sets can include structured, unstructured, and semistructured data, each of which can be mined for insights.
How much data actually constitutes “big” is open to debate, but it can typically be in multiples of petabytes—and for the largest projects in the exabytes range.
Often, big data is characterized by the three Vs:
The data that constitutes big data stores can come from sources that include web sites, social media, desktop and mobile apps, scientific experiments, and—increasingly—sensors and other devices in the internet of things (IoT).
The concept of big data comes with a set of related components that enable organizations to put the data to practical use and solve a number of business problems. These include the IT infrastructure needed to support big data; the analytics applied to the data; technologies needed for big data projects; related skill sets; and the actual use cases that make sense for big data.
What really delivers value from all the big data organizations are gathering is the analytics applied to the data. Without analytics, it’s just a bunch of data with limited business use.
By applying analytics to big data, companies can see benefits such as increased sales, improved customer service, greater efficiency, and an overall boost in competitiveness.
Data analytics involves examining data sets to gain insights or draw conclusions about what they contain, such as trends and predictions about future activity.
By analyzing data, organizations can make better-informed business decisions such as when and where to run a marketing campaign or introduce a new product or service.
Analytics can refer to basic business intelligence applications or more advanced, predictive analytics such as those used by scientific organizations. Among the most advanced type of data analytics is data mining, where analysts evaluate large data sets to identify relationships. patterns, and trends.
Data analytics can include exploratory data analysis (to identify patterns and relationships in data) and confirmatory data analysis (applying statistical techniques to find out whether an assumption about a particular data set is true.
Another distinction is quantitative data analysis (or analysis of numerical data that has quantifiable variables that can be compared statistically) vs. qualitative data analysis (which focuses on nonnumerical data such as video, images, and text).
For the concept of big data to work, organizations need to have the infrastructure in place to gather and house the data, provide access to it, and secure the information while it’s in storage and in transit.
At a high level, these include storage systems and servers designed for big data, data management and integration software, business intelligence and data analytics software, and big data applications.
Much of this infrastructure will likely be on-premises, as companies look to continue leveraging their datacenter investments. But increasingly organizations rely on cloud computing services to handle much of their big data requirements.
Data collection requires having sources to gather the data. Many of these—such as web applications, social media channels, mobile apps, and email archives—are already in place. But as IoT becomes entrenched, companies might need to deploy sensors on all sorts of devices, vehicles, and products to gather data, as well as new applications that generate user data. (IoT-oriented big data analytics has its own specialized techniques and tools.)
To store all the incoming data, organizations need to have adequate data storage in place. Among the storage options are traditional data warehouses, data lakes, and cloud-based storage.
Security infrastructure tools might include data encryption, user authentication and other access controls, monitoring systems, firewalls, enterprise mobility management, and other products to protect systems and data,
In addition to the foregoing IT infrastructure used for data in general. There several technologies specific to big data that your IT infrastructure should support.
Hadoop is one of the technologies most closely associated with big data. The Apache Hadoop project develops open source software for scalable, distributed computing.
The Hadoop software library is a framework that enables the distributed processing of large data sets across clusters of computers using simple programming models. It’s designed to scale up from a single server to thousands, each offering local computation and storage.
The project includes several modules:
Part of the Hadoop ecosystem, Apache Spark is an open source cluster-computing framework that serves as an engine for processing big data within Hadoop. Spark has become one of the key big data distributed processing frameworks, and can be deployed in a variety of ways. It provides native bindings for the Java, Scala, Python (especially the Anaconda Python distro), and R programming languages (R is especially well suited for big data), and it supports SQL, streaming data, machine learning, and graph processing.
Data lakes are storage repositories that hold extremely large volumes of raw data in its native format until the data is needed by business users. Helping to fuel the growth of data lakes are digital transformation initiatives and the growth of the IoT. Data lakes are designed to make it easier for users to access vast amounts of data when the need arises.
Conventional SQL databases are designed for reliable transactions and ad hoc queries, but they come with restrictions such as rigid schema that make them less suitable for some types of applications. NoSQL databases address those limitations, and store and manage data in ways that allow for high operational speed and great flexibility. Many were developed by companies that sought better ways to store content or process data for massive websites. Unlike SQL databases, many NoSQL databases can be scaled horizontally across hundreds or thousands of servers.
An in-memory database (IMDB) is a database management system that primarily relies on main memory, rather than disk, for data storage. In-memory databases are faster than disk-optimized databases, an important consideration for big data analytics uses and the creation of data warehouses and data marts.
Big data and big data analytics endeavors require specific skills, whether they come from inside the organization or through outside experts.
Many of these skills are related to the key big data technology components, such as Hadoop, Spark, NoSQL databases, in-memory databases, and analytics software.
Others are specific to disciplines such as data science, data mining, statistical and quantitative analysis, data visualization, general-purpose programming, and data structure and algorithms. There is also a need for people with overall management skills to see big data projects through to completion.
Given how common big data analytics projects have become and the shortage of people with these types of skills, finding experienced professionals might be one of the biggest challenges for organizations.
Big data and analytics can be applied to many business problems and use cases. Here are a few examples:
This story, “What is big data analytics? Everything you need to know” was originally published by