Terms and Phrases | Definition |
---|---|
Compute ("Compute Template," "Cluster Template," "Cluster") | As used by Infoworks, compute refers to a configuration within an environment template referring to a compute provider available from the data engine configured for this environment. An individual compute configuration may serve as a template for an ephemeral cluster to be provisioned on demand. Or, it may be a reference to a persistent cluster, which is either provisioned and continuously available, or is controlled via a cluster action node within a related workflow. |
Data Engine | As used by Infoworks, a data engine is a provider of data management services, which runs over distributed compute. Data engines may reflect a data lake architecture, a data warehouse architecture, or expose features blending both approaches. As multiple engines power an airliner, data engines support the data ingestion, transformation, and migration services automated by Infoworks. |
Data Model | As used by Infoworks, a data model is a data table generated as an output of a transformation pipeline, in storage available to the environment in which the transformation pipeline is run. |
Data Source | As used by Infoworks, data source refers to the metadata and data ingested through a connection defined and made based on a source configuration. Data ingested as a data source may serve as a source data table within a transformation pipeline. |
Data Table | As used by Infoworks, a data table is either data ingested as a data source, or data subsequently modified by a transformation pipeline into a new data model. Either form of data table may be used as source data for a transformation pipeline. A transformation pipeline must have one, and may have many, data tables. |
Domain | As used by Infoworks, a domain is a named set of logically related users and data sources, to support collaboration on transformation pipelines and workflow orchestrations built within the domain. Pipeline extensions are also assigned by domain. |
Environment | As used by Infoworks, an environment is a one specific data engine running within a specific cloud service provider, configured to work with a specific set of storages and computes. An environment is defined by an environment template. |
Environment Template ("Environment Configuration") | As used by Infoworks, an environment template is set of configurations comprised of settings related to (1) a cloud service provider, (2) a data engine in this cloud, including its related metastore, (3) one or more storages available to this engine, in this cloud, and (4) one or more computes available to this engine, in this cloud. |
External Data Source | As used by Infoworks, an external data source is one of over two hundred distinct data management systems and formats for which a data source can be configured by Infoworks to ingest, transform, and migrate data within and across environments. |
External Data Target | As used by Infoworks, an external data target is a cloud data warehouse, cloud data lake, or other external data management system, to which Infoworks may export data via transformation pipeline, if intermediate transformation is required, or sync-to-target, if direct export of ingested data without transformation, is required. |
Incremental Ingestion | As used by Infoworks, incremental ingestion is the process of moving data from an external data source to a storage within an environment, in batches scheduled to optimize overall system behavior, and to synchronize changing data sets across environments. |
Ingestion | As used by Infoworks, ingestion is the process of moving data from an external data source to storage within an environment, for further transformation |
Ingestion Configuration | As used by Infoworks, an ingestion configuration is the result of a design process based on metadata crawled from an external data source. An ingestion configuration may be run as an ad hoc or scheduled job, and may be incorporated within a workflow either via UI or API. |
Job | As used by Infoworks, a job is a set of related tasks defined by 1) an ingestion configuration, whether for a single data table or table group, 2) a transformation pipeline, and/or 3) a Sync-to-target execution. Jobs are defined via UI or API, compiled to environment-specific code, then passed to compute available for a data engine within the relevant environment for execution. Jobs are commonly executed as part of a workflow. |
Metadata Crawl | As used by Infoworks, metadata crawl is a process that discovers schema, types, and other relevant metadata from an external data source, to support designing ingestion configurations. |
Migration | As used by Infoworks, migration is the process of moving data from an external data source, or from storage within an environment, to storage in a different environment, for the purpose of use or transformation by an external system. |
Onboard | As used by Infoworks, onboarding refers to ingesting data from an external data source to storage in a target environment. As a feature set, onboarding is focused on the ingestion configurations users design from information derived from metadata crawls, along with governance features for the ingested data. |
Operationalize | As used by Infoworks, operationalizing refers to the scheduling, logical flow, monitoring, execution, and re-execution of ingestion, transformation, and migration jobs. As a feature set, operationalizing is focused on the workflow orchestrations users design either through the Infoworks UI or API. |
Pipeline Extension | As used by Infoworks, a pipeline extension is a custom-designed transformation type made available for use by transformation pipelines within a domain to which the pipeline extension has been applied. |
Prepare | As used by Infoworks, preparing refers to automating the transformation, cleansing, enrichment, and modeling of data to support known and discovered use cases and processes. As a feature set, preparing is focused on the transformation pipelines users design to transform one or more data tables into one or more new data models. |
Source Configuration | As used by Infoworks, a source configuration is a named and configured reference to an external data management system. Through connections made based on this configuration, Infoworks executes a metadata crawl to derive information supporting the subsequent design of one or many ingestion jobs. |
Storage | As used by Infoworks, a storage is a physical location for holding binary data, configured for availability to a specific environment. An environment can have any number of related storages. Storage is assigned when designing ingestion, transformation, and migration jobs for an environment. |
Sync-to-target | As used by Infoworks, sync-to-target is a feature enabling the direct export of a data table or table group to an external data target, without use of an intermediate transformation pipeline. |
Table Group | As used by Infoworks, a named set of data tables to be ingested as a single job. |
Target | As used by Infoworks, a target is a terminal node, within a transformation pipeline, which names and defines a data model to which transformed data will be output. A transformation pipeline must have one, and may have many, targets. |
Transformation | As used by Infoworks, transformation refers to merging, cleansing, and enriching data through one or more transformation pipelines. The term may also be used more narrowly, to refer to a single node within a transformation pipeline as a transformation. |
Transformation Pipeline | As used by Infoworks, a transformation pipeline is an individual, visually defined sequence of interrelated transformations, drawing on one or more data tables, to generate one or more data models, each time the transformation pipeline is run as a job. When executed, a transformation pipeline is converted to SQL (and related code extensions), appropriate to the data engine for which it is designed. |
Workflow ("Workflow Orchestration," "Orchestration") | As used by Infoworks, a workflow is a set of ingestion, transformation, and migration processes, which are logically sequenced and scheduled by the workflow. A workflow may be defined visually or via REST API, and run either via UI or API. |
Workflow Run | As used by Infoworks, a workflow run is one execution of a coordinated set of workflow tasks which have been defined visually via UI as a workflow, or via REST API. |
Workflow Task | As used by Infoworks, a workflow task is a job, shell script, notification, or related behavior, coordinated as one part of a workflow. |
Workload | As used by Infoworks, the term workload abstractly describes a set of data ingestion, transformation, and migration processes, orchestrated as one or more jobs within one or more workflows, to collectively accomplish a specified data managed goal. |
Terms and Phrases | Definition |
---|---|
Analytics Agility | The ability to quickly and efficiently transform data to support newly discovered analytic requirements. |
Apache Hadoop | An open source technology implementing a MapReduce algorithm to divide and distribute segments of very large data sets across many nodes in a cluster for analysis, increasing the total processing speed through parallelization, then stitching the distributed results into a final set |
API (application programming interface) | An endpoint, or set of endpoints, exposed by one software system to support direct communication from another. Commonly, one or more values will be passed to an API, which will then respond with a behavior, and may pass back one or more values to the system which is called the API. |
Automation | Optimizing human-involved processes by using configuration, not code, to implement and run best practice software solutions. |
AWS (Amazon Web Services) | The Amazon cloud service provider, which among many other services, hosts EMR, a distributed data lake style management system based on Apache Spark, and supported by Infoworks. AWS also supports RedShift, a distributed data warehouse style data management system. |
Azure | The Microsoft cloud service provider, which among many other services, hosts HDInsight, a distributed data lake style management system based on Apache Hadoop. It also supports Azure Databricks, a data lake style data management system, supported by Infoworks. It also supports Azure Synapse, a data warehouse style data management system. |
Best Practice | An industry term referring to a specific business process and/or software implementation or configuration which is widely accepted as an optimal approach to solving a recurring problem. |
Big Data | Data domains in which the total volume, variety, and velocity of data points to be ingested and analyzed exceeds the capacity of traditional monolithic RDBMS, and instead requires clustered computing solutions. |
Change Data Capture (CDC) | A set of design patterns by which the pre-change and post-change state of a record are captured, with timestamps and similar metadata, to enable robust logging, pause/restart, and similar features, as part of a data ingestion or export process. |
Closed Source Software | Closed source software delivers binary products compiled from source code, which itself is retained as confidential work product. The compiled binary code may be used in whatever manner is allowed by the license under which it is distributed and/or sold. In these ways, it is the opposite of open source software. |
Cloud Data Lake (CDL) | Software implementing a data lake architecture from within a cloud service provider. Migrating data from on-premise data sources to a cloud data lake or cloud data warehouse, or between these architectures, is a core function of Infoworks. |
Cloud Data Migration (CDM) | The automated processes by which on-premise data is selectively or comprehensively moved to cloud service providers, commonly into data lake and/or data warehouse architectures, or both, depending on the structure and strategy needed to support intended workloads. Cloud Data Migration is a core Infoworks function. |
Cloud Data Warehouse (CDW) | Software implementing a data warehouse architecture from within a cloud service provider. Historically, most data warehouse software ran as an on-premise system. Cloud Data Migration, which is the migration of data from on-premise data management systems, of all types, including data warehouses (e.g., Teradata) and distributed data processing systems (e.g., Hadoop), is a core Infoworks function. |
Cloud Service Provider (CSP) | A vendor offering computing infrastructure and software as a service (e.g., GCP, Azure, AWS, etc). |
Coarse Data | Data which has not been cleansed. |
Compute Provider | A predefined processing resource capable of being assigned to a job to run a workload. |
Configuration, not code | An industry phrase referring to a trend towards abstracting tasks, which previously required code, to interfaces and applications which need only to be provided relevant settings, then run, to generate the same outcomes which previously required designing, writing, and maintaining custom code. |
CSV (Comma Separated Value) File | A structured data file containing text written as rows and columns, to express tabular data, where each value in a row is separated by a comma. Commonly, but not necessarily, the first row lists column headers relating to values in the same ordinal position in subsequent rows. Common variants include tab and pipe ( | ) separated values. CSV files may be ingested by Infoworks |
Cube | A common data warehouse design pattern in which a fact table, along with a set of related dimension tables, are designed, and all relevant measurements (aggregations) across all possible rows, have been pre-calculated for fast lookup. |
Data Architecture | The systems and processes available to deliver the storage and compute resources needed to manage ongoing workloads related to the structured data and unstructured data in an enterprise data ecosystem. |
Data Cleansing | The process of detecting, correcting, and potentially removing incomplete, inaccurate, incorrect, or irrelevant data from a table or database. Uncleansed data is commonly described as dirty or coarse. |
Data Compute | Compute resources within a data engine which are available for assignment to a workload. These data compute resources are typically persistent clusters or ephemeral clusters. |
Data Engine | A system designed to perform big data operations and analytics using clustered computing infrastructure. Infoworks abstractly automates data engines, ensuring long term flexibility for enterprises using these tools. Data engines generally fall into two categories, those based on Apache Hadoop and those based on Apache Spark. Hadoop based data engines include: Cloudera, Hortonworks, MapR, and HDInsight. Spark based data engines include: Databricks, GCP Dataproc, AWS EMR, Snowflake, and GCP BigQuery. |
Data Enrichment | A process by which fields and records may be modified to add additional data and/or structure during an ingestion or transformation process. |
Data Fabric | A technical product marketing term describing a data architecture exposing, to end users, a searchable, self-service set of data services which provide consistent capabilities to integrate, aggregate, prepare, govern, and orchestrate all data types, exposed via endpoints from across a hybrid, multi-cloud environment. Infoworks helps enable data fabric architectures, which are themselves integrated from many systems. |
Data Governance | A set of processes by which data access, data change, and related regulatory compliance procedures, can be monitored and controlled. |
Data Ingestion | The process of transferring data from a source (e.g., RDBMS table, CSV file, API, etc.) to a data lake (e.g., Infoworks). The transfer process may be variously configured. For example, a small table with few changes over time may be fully loaded, and periodically replaced. By contrast, a large table may be incrementally ingested, and it’s ongoing updates captured by CDC. |
Data Lake | A repository of unstructured, structured, or semi-structured data, ingested from many sources, available for machine learning, similar use cases, and for transformation into structures needed to support workloads which have been newly discovered via machine learning. A data lake may be on-premise, in the cloud, or hybrid, and may be part of a modern data architecture. |
Data Lakehouse | An environment in which data structured using a data warehouse architecture (for business reporting) is simultaneously available in an unstructured data lake architecture (for machine learning). Different vendors implement this architecture differently. For example, Cloud Data Warehouse vendor Snowflake can enable a "snowpark" of unstructured data, available for machine learning, as part of its onboarding process. And, Cloud Data Lake vendor Databricks can enable a "lakehouse" of structured data via a virtual data warehouse exposed over an underlying data lake. |
Data Mart | A subset of a data warehouse focused around a specific business domain, built to increase access speed for that domain's analytics. |
Data Operations | Often abbreviated as "DataOps," this industry term refers to an automated, process-oriented methodology, used by analytic and data teams, to improve the quality and reduce the cycle time of data analytics, by automating the flow and transformation of data from source to target. While DataOps began as a set of best practices, it has matured to become a new and independent approach to data analytics. |
Data Pipeline | An industry term referring to the comprehensive flow of data from a source to a target, including physical ingestion and transformation, as well as scheduling and latency operations related to these processes. In Infoworks, a transformation pipeline handles only the transformation aspects of this flow. Ingestion and workflow orchestration are handled by separate aspects of the product. |
Data Transformation | The process of merging, cleansing, and enriching data to support new use cases. Data transformation is commonly, but not necessarily or entirely, accomplished through the use of SQL. See transformation pipeline. |
Data Warehouse | A repository of structured data, ingested from many sources, and pre-transformed into structures needed to support known business intelligence workloads, and related analytics. It may be on-premise, in the cloud, or hybrid. A data warehouse may be part of a modern data architecture. Commonly, warehoused data will be denormalized to better support standardized analytical tasks. |
Data Warehouse Automation | The process of applying automation to data ingestion and transformation processes relating to a data warehouse. |
Data-Driven Organization | An organization which has built an enterprise data architecture to proactively gather, share, and mine data, towards supporting business decisions before specific needs have been identified. |
Dataproc | GCP distribution of Apache Spark. |
Denormalization | The process of enriching records with redundant data from other tables, forming a single composite record, eliminating the need to join related records to access composite data, and thus optimizing for greater access speed. See normalization. |
Digital Transformation | A change process by which automated data ingestion and data transformation workloads are implemented to integrate legacy data systems into a modern data architecture, creating an agile, comprehensive data platform, through which enterprise-wide data is pervasively gathered, measured, described, located, and consumed by business users, to drive business efficiencies. |
Dirty Data | Data which has not been cleansed. |
Distributed Compute | A compute architecture by which workloads are scattered among multiple individual processing nodes, then gathered back to assemble and return results, when workload processing is complete. This architecture enables the processing of data workloads too large for an individual computing node. |
EDO2 (Enterprise Data Operations and Orchestration) | A technology by which the onboarding, preparation, and operationalization of any number of data sources and targets may be automated, enabling the efficient implementation of a hybrid, multi-cloud data fabric. |
EDW Migration | See Cloud Data Migration. |
ELT (Extract, Load, Transform) | A data migration pattern by which applications read (extract) data from a data source (commonly OLTP systems), and insert (load) it directly to a data lake, where it may later be cleansed, enriched, targeted for machine learning, or have structure added to the data (transform). |
EMR | AWS (Amazon Web Services) distribution of Hadoop |
Enterprise Cloud Bridge (ECB) | On premise Infoworks component that crawls metadata, and ingests data into an Infoworks managed cloud data lake or cloud data warehouse. |
Ephemeral Cluster | A set of virtual compute nodes, designed to run together as a cluster within a distributed data management system (e.g., Apache Spark, etc), which are instantiated ("spun up") based on a configuration template then, once running, are assigned to perform a specified job. When the job is complete, the ephemeral cluster is decommissioned ("spun down"), to minimize compute resource cost. By spinning down rather than idling, ephemeral clusters are more cost-efficient for certain job processing patterns than persistent clusters. |
ETL (Extract, Transform, Load) | A data migration pattern by which applications read (extract) data from a data source (commonly OLTP systems), denormalize, cleanse, and enrich the data (transform), then insert the transformed data (load) to a data warehouse. |
Expert Systems | A system capturing the process flow of a particular problem domain, such that if particular factors are exhibited, then a particular decision should be made. They are fundamentally different from ML, in that they simply transcribe - more or less accurately - a human generated decision making model (heuristic). By contrast, ML develops its own heuristic based on analyzing many examples and observing the relationships between them. |
Flume | A technology for ingesting streaming data into Hadoop. |
GCP (Google Cloud Platform) | The Google cloud service provider, which among many other services, hosts Dataproc, a distributed data lake style management system based on Apache Spark, and supported by Infoworks. It also supports BigQuery, a distributed data warehouse style data management system. |
Granularity | The level of specificity at which facts are recorded in a table (e.g., total_day___sales vs. sales_id for a each product_id sold.) Higher granularity may be more flexible, but also increases storage and analytical effort. |
Hadoop Migration | The process of moving existing workloads and data from on-premise Hadoop implementations to a cloud data warehouse or cloud data lake. |
HDFS | A technology for distributing data files across a cluster, relying on a central NameNode to manage access to potentially thousands of DataNode instances. |
Hive | A technology which provides an abstraction layer over HDFS data stores, by rewriting SQL (HiveQL) statements into corresponding Hadoop MapReduce (or Spark, Tez, etc.) jobs, executing those jobs, and returning a resulting dataset or response. Historically focused on high throughput ETL for Data Engineers. |
Hybrid, Multi-Cloud | A digital infrastructure in which on-premise compute infrastructure participates in common, secure networks with other systems and services exposed through one or more cloud service providers. |
IaaS (infrastructure as a service) | The provisioning of compute and storage resources by a cloud service provider. |
Impala | A technology, similar to Hive, providing an abstraction layer over HDFS data stores, scattering SQL statements for parallel processing on individual HDFS nodes, then gathering and assembling results. Historically focused on low latency interactive query, for analysts. Less fault tolerant than Hive. |
JDBC | An open standard Java API that manages connecting to a database, issuing queries and commands, and handling result sets obtained from the database. |
JSON (Javascript Object Notation) File | A structured data file containing text written in Javascript Object Notation, enabling the expression of more complex, hierarchical relationships than CSV notation. JSON files may be ingested by Infoworks. |
Kafka | An open source distributed event-streaming platform used to build data pipelines, streaming analytics, and data integration capabilities. |
Logical Data Warehouse | A recent industry term describing a logical (virtual) structure created over disparate sources, with pre-integrated analytical tools, governance, and unified metadata, to provide a "single version of truth" for data warehouse use cases, while underlying data remains available, for use cases better suited to a data lake or other architecture. A core design goal is "store once, share many." |
Machine Learning (ML) | Machine Learning (ML) - The application of standard algorithms to prepared data sets in order to detect patterns and derive predictive insights supporting ongoing decision-making processes. There are six steps in the standard ML lifecycle:
|
MapReduce | An open source technology for distributing Jobs and their Tasks across the nodes of an HDFS file system, for processing in a scatter-gather architecture. |
Merge | An ETL pattern in which a target table is incrementally updated by comparing source and target data, and either updating or inserting target rows from source data, as needed. |
Metadata | Data which describes other data, and falls into three broad categories: descriptive, administrative, and structural. Descriptive data enables discovery, identification, and selection (e.g., title, author, subject, keywords, etc). Administrative metadata enables management (e.g., permissions, timestamps, etc). Structural metadata identifies data types and relationships (e.g., column headers, data types, foreign keys, indexes, etc). |
Model | An algorithmic decision making framework which is trained by relevant data exposure so that, based on a given set of inputs (A), it is capable of predicting an output (B) within a targeted degree of statistical accuracy. |
Modern Data Architecture | An industry term describing systems integrated to expose both structured data and unstructured data to appropriate, corresponding workloads, typically deployed over hybrid, multi-cloud infrastructure. |
Normalization | he process of organizing data within related tables, to reduce redundancy, by placing records with logically common attributes in the same table, then defining dependencies from those records to other tables holding related records (e.g., a Customer table may be related to an Order table). This process tends to improve data consistency and reduce total storage cost, but also reduce access speed, while increasing the cost and complexity of data retrieval. Successively more comprehensive degrees of normalization have been defined as Normal Forms (First Normal Form, Second Normal Form, etc). See denormalization. |
OLAP (Online Analytical Processing) | Umbrella term for applications loading and reading denormalized facts and their related dimensions for analytical use. OLAP data is commonly stored in 2NF relational tables, OLAP cubes, Hadoop, etc. |
OLTP (Online Transactional Processing) | Umbrella term for applications creating, reading, updating, and deleting business-critical facts. OLTP data is commonly stored in 3NF relational tables. |
On-Premise | A term commonly referring to compute infrastructure directly owned and maintained by its end user. |
Open Source Software (OSS) | A term which refers to how software is licensed and distributed. Open source software provides access to uncompiled source code, and may be used, re-used, modified, and recompiled without cost, provided the user complies with the terms of its license. In this way, it is the opposite of closed source software. |
Orchestration of Data Operations | The process of coordinating, scheduling, monitoring, and controlling the ongoing state of a set of data ingestion, transformation, optimization, cataloging, and export operations. |
Parallelization of Data Loads | A technique by which data ingestion or export is configured to run via two or more connections, to increase total throughput speed. |
Parquet | A column oriented data storage format widely used in the Hadoop ecosystem. |
Persistent Cluster | A set of virtual compute nodes, designed to run together as a cluster within a distributed data management system (e.g., Apache Spark, etc), which are maintained in a running state to be available for assignment to specified jobs. When the job is complete, the persistent cluster remains available for the next job. By remaining available, and avoiding spin-up and spin-down times, persistent clusters are more cost-efficient for certain job processing patterns than ephemeral clusters. |
Pig | A technology and high level data manipulation syntax (PigLatin) for generating Hadoop MapReduce jobs, to reduce reliance on lower level Java coding. |
Pipeline | The full sequence of data migration related activity, from ingestion, through transformation, to the scheduling and orchestration of these operations. See transformation pipeline. |
Platform | A computing environment providing a comprehensive set of tools for defining and executing processes specific to the business, content, or technical domain to which the platform applies (e.g., personal computer operating system, cloud service provider, RDBMS, storage network, etc). Commonly, platforms overlap and interrelate in their capabilities. |
RDBMS (relational data base management system) | A technology enabling the use of SQL for defining, creating, reading, updating, and deleting records from tables, while expressing relationships among these tables to support querying for rows comprised of data which resides, in parts, among multiple tables in the database. Many additional user, process, access latency, and security related features are commonly packaged as part of an RDBMS product. |
SaaS (software as a service) | Software available from a cloud service provider, generally billed by compute resources used. It may be more efficient than on-premise software, in terms of total cost of ownership. |
Slowly Changing Dimensions | Dimensions related to facts that change over time without following any inherent pattern (e.g., product_price). There are six established external patterns by which SCD may be tracked within a Data Warehouse. |
Solution | As used in the software industry, this term commonly refers to a set of tools which have been integrated to solve problems in a specific business domain, using automation and configuration, not code. The differences between tools and solutions can be described along a spectrum. |
Source Type Extension | A process by which data types assigned to values in a data source may be converted to a corresponding data type in a target data destination, with no loss of precision. This technique can optimize data access from the target by streamlining diverse source data types to common types in the target. |
Speed To Value | A phrase describing a primary value proposition of the automation provided by systems like Infoworks. Cloud Data Migration adds significant value to enterprise data operations, and Infoworks speeds the process of achieving this value. |
SQL (Structured Query Language) | A classic database technology, expressed as a language, with both standard forms and many vendor specific variants, for defining, populating, and querying the contents of relational data structures. In Infoworks, transformation pipelines provide a high level abstraction for generating complex SQL statements, for many different environments, integrated with and supported by related technologies. |
Sqoop | An open source technology for ingesting RDBMS data into Hadoop. |
Structured Data | Data sets for which one or more records share a common, well organized format, often but not necessarily described by external metadata (e.g., an RDBMS table, CSV or JSON file). |
Tool | As used in the software industry, this term commonly refers to a software implementation which is highly focused to solve a relatively granular problem. Commonly, a tool will be designed to expose an API enabling the tool to be incorporated as part of a wider solution. The differences between tools and solutions can be described along a spectrum. |
Truncate and Load | An ETL pattern in which the target table is truncated (DDL based fast delete of all contents) and fully re-loaded with source data. |
Unstructured Data | Data sets for which no readily discernible common format is available without complex introspection (e.g., a folder of various business documents, or of log files generated by disparate systems). |
Workflow Orchestration | The process of optimizing related ingestion and transformation workloads by coordinating the runtimes and resources assigned to their underlying jobs. |
Terms | Definition |
---|---|
Data Definition Language (DDL) | The aspect of SQL which applies to defining, creating, and modifying database objects, including tables and their data types,indexes, and users. |
Dimension Table | A table comprised of attributes related to a fact, providing the metrics by which those related facts may be measured as to the who, what, when, where, why, how, and how many of those facts. Can be visualized as a star of dimension fields related to a particular fact field. Where dimension fields may be shared by multiple facts, this relationship can be visualized as a constellation. As a generalization, quantitative (numeric) fields tend to be facts, while qualitative (text) fields tend to be dimensions. |
Endpoint | The physical address of a service, commonly expressed as a URL. |
Fact Table | A table comprised of granular data points to be analyzed by a set of related dimensions (who, what, when, where, why, how, how many). The granularity of this table is the level at which the underlying business fact occurs (e.g., one sale of a specific product.) Commonly comprised of a specific data point (aka, fact or measure) (e.g., units_sold), and a dimension key (e.g., product_id) by which that measure can be related to its various dimensions of measurement. As a generalization, quantitative (numeric) fields tend to be facts (aka, metrics or measurements), while qualitative (text) fields tend to be dimensions (aka, attributes or descriptions.) |
Field | One data element in a record. |
Hive Bucketing | A technique by which records within any given partition (e.g., State within Orders, City within State within Orders, or even just Orders) are stored in one of a specified number of data files, called buckets, rather than a single data file for each partition. The destination bucket for each record stored in this partition is determined by a hash of its key. Like partitioning, this technique increases the potential for parallel rather than sequential table scans. Also note that this technique may be implemented either alternately or in addition to partitioning. (Orders table clustered by 3 buckets) /Orders part-0000 part-0001 part-0002 etc. (Orders table clustered by 3 buckets, with partitioning by State) /Orders /Alaska part-0000 part-0001 part-0002 etc. |
Hive Partitioning | A technique by which data is physically organized into folders, called partitions, based on values in a designated partition column. For example, an Orders table could be partitioned by State values. In a subsequent query filtering results by State, only records from the matching State folder would be loaded and processed for this query. Note the relevance of partition column cardinality. High cardinality results in a higher number of smaller partitions. How may this relate with system capacity for parallel processing? Too many partitions in a cluster with too little parallel capacity to efficiently process all those partitions will impede, not improve, performance. Partitions can also be hierarchically nested. For example, Orders could be partitioned by City within State (or perhaps by Year, Month, Day). In a query filtering both by City and State, an even smaller and more efficient set of files would be loaded into memory, to process the rest of the query. This technique increases the potential for parallel rather than sequential table scans. /Orders /Alaska /Anchorage /Fairbanks etc.. /Arkansas etc.. |
ML Algorithms (Supervised Learning) | Linear Regression - used to predict numeric values Logistic Regression - used to predict binary outcomes Decision Tree - used for categorization problems Random Forest - improved decision tree Naïve Bayes - prediction from independent predictors |
ML Categories | Supervised - Given A, predict B. Machine is trained to identify a type of data sharing a common label (e.g., face) by analyzing a volume of data points that both match (e.g., a face) and do not match (e.g., not a face) the target label. The distinguishing feature of this category is that the desired output (e.g., faces) is known in advance. Reinforcement - Machine is trained to classify data points based on their interaction with a common environmental model, wherein successful interactions are rewarded, and the model updated to prefer data points exhibiting the features of previously successful data points. For example, if a red apple were found successful based on its characteristics (color, shape), a green apple would be more likely to be flagged as an apple based on its shape despite the difference in color. Unsupervised - Machine is trained to classify data points according to common measures detected during the learning processes. For example, fruit might be classified by its color, size, surface texture, prevalence within the set, etc. Transfer - Trained models from one use case (e.g., sort cats from dogs) is applied to a new use case (e.g., sort fruits from vegetables) |
ML Cost Benefit | Do the benefits of the repeated decisions made based on predictive insights outweigh the cost of acquiring data and training the required models? ML is most valuable to ongoing processes and their decisions, and much less useful with occasional or ad hoc decisions (because they provide too little data to analyze.) |
ML Data Cleanliness | The degree to which average data inputs within a set are properly formatted, and are accurate relative to the question to be analyzed (eg., do all inputs record temperature data by the same measure: F or C?). |
ML Data Completeness | The extent to which all factors necessary for analysis are represented within the target data set (eg., if you are analyzing for product defects, and heat can cause defects, do your inputs include temperature data? Do you have the right and sufficient columns to describe the problem?) |
ML Data Coverage | The degree to which all possible variant states to be analyzed are represented within a data set (eg., if you are analyzing for product defects caused by heat, motion, light, and impact, does the target set including multiple examples of defective states caused by each of these factors? Do your rows of data cover all the possible variants?) |
ML Example | A single item comprised of one or more features, which are distinct data elements describing the example, and a label, which is the definitive numeric value for, or text classification of, this example. ML Models are used to predict the label for a proposed example based on patterns learned from the examples provided during training. Features are descriptions that allow algorithms to figure out a label. Labels are the values to be predicted. |
ML Libraries (Python) | Matplotlib - generates charts from data. Seaborn - visualization of the distribution within a statistical model (e.g., heat maps). Scikit-Learn - Mining and analysis tool extending NumPy and Matplotlib. Pandas - Data wrangling tool enabling manipulation, aggregation, and some visualization. NumPy - Operations on numeric arrays and matrices. |
Record | A set of related fields. See row. |
Row | A set of related fields (attributes) expressed as one element within a data table. |
SCD 0 | Changes to source records are not tracked in the Data Warehouse. Applied when source dimensions are no longer relevant to the purpose of the DW. |
SCD 1 | Changes to source records are propagated to the Data Warehouse, but no history is retained about the change. Applied when only the current dimension will ever be relevant to the facts. |
SCD 2 | Changes to source records are propagated to the Data Warehouse as new rows, with a timestamp indicating the effective date of the add/change. Any further changes to this record cause another new row to be added for the same row key, with updated fields, and an updated effective date timestamp. A current/active record field may also applied to such records, to assist with indexing and filtering for current records, though currency may also be calculated based on date values. Applied when historic data may be relevant to future analyses. |
SCD 3 | Changes to source records are propagated to the Data Warehouse, by adding new columns to an existing record (e.g., address field and previous_address field), along with an effective date field for the current state of the current fields. Rarely used, because only one change (previous from current) is tracked, and because it complicates coding and reporting, by requiring the use of alternative field names when accessing historical data. |
SCD 4 | Changes to source records are propagated to the Data Warehouse, by updating the current record to new values, with an effective date timestamp, and also migrating the previous values to a related history table, which includes both start and end date timestamps for this record. May be used as an alternative to SCD2, if there is a need to control the size of the primary tables (e.g., high volume transaction tables, where there could be hundreds of updates per day / hour / minute.) |
SCD 6 (1+2+3) | Changes to source records are propagated to the Data Warehouse, by adding a new record to the existing table which both includes current and previous columns for relevant fields, as well as effective date timestamps for changes, along with a current flag. Not commonly used, as it is effectively redundant in its ability to access historical data, by either using the previous value field, or accessing based on effective date timestamps, or based on using the current field flags. |
Table | A set of one or more records arranged in rows (records) and columns (fields, attributes). |