HHPC 10061
Verify this is the correct version before use.
Information Management Platform for
Data Analytics and Aggregation (IMPALA)
System Design Document
Human Health and Performance Directorate
Human Systems Engineering and Development Division
Configuration Controlled HHPD TRB
Verify that this is the correct version before use.
August 2016
Baseline
National Aeronautics and Space Administration
Lyndon B. Johnson Space Center
Houston, Texas 77058
HHPC 10061
CONTRACTOR APPROVAL SHEET
Information Management Platform for
Data Analytics and Aggregation (IMPALA)
System Design Document
Contract Number: T73062
PREPARED BY: //original signature on file// 08/16/2016
Akinyele Akinyelu DATE
LM Senior Architect
APPROVED: //original signature on file// 08/17/2016
Alan Ruter DATE
LM Project Manager
HHPC
APPROVED: //original signature on file// 08/22/2016
Ram Pisipati DATE
Project Integrator
HHPC
APPROVED: //original signature on file// 08/23/2016
Keith Kreutzberg DATE
Wyle Technical Representative
HHPC
NATIONAL AERONAUTICS AND SPACE ADMINISTRATION
Lyndon B. Johnson Space Center
Houston, Texas
HHPC 10061
NASA APPROVAL SHEET
Information Management Platform for
Data Analytics and Aggregation (IMPALA)
System Design Document
APPROVED: //original signature on file// 08/30/2016
Andrew Carnell DATE
Enterprise Architect/SF5
NASA
APPROVED: //original signature on file// 08/30/2016
Kathy Johnson-Throop DATE
Information Systems Architecture Branch Chief/SF5
NASA
NATIONAL AERONAUTICS AND SPACE ADMINISTRATION
Lyndon B. Johnson Space Center
Houston, Texas
HHPC 10061
CHANGE HISTORY
REVISION/
CHANGE
DATE AUTHORIZATION DESCRIPTION OF CHANGE
8/30/2016 IMPALA TRB Baseline Release
HHPC 10061
i
Verify that this is the correct version before use.
TABLE OF CONTENTS
1.0 INTRODUCTION ...................................................................................................................................... 1-1
1.1 PURPOSE OF THE SYSTEM DESIGN DOCUMENT ............................................................................................. 1-1
2.0 GENERAL OVERVIEW AND DESIGN GUIDELINES/APPROACH ................................................................... 2-1
2.1 GENERAL OVERVIEW ................................................................................................................................... 2-1
2.2 ASSUMPTIONS/CONSTRAINTS/RISKS ............................................................................................................ 2-2
2.2.1 Assumptions ......................................................................................................................................... 2-2
2.2.2 Constraints ........................................................................................................................................... 2-3
3.0 DESIGN CONSIDERATIONS ...................................................................................................................... 3-1
3.1 GOALS ....................................................................................................................................................... 3-1
3.2 DEPENDENCIES ........................................................................................................................................... 3-1
4.0 OVERALL SYSTEM ARCHITECTURE ........................................................................................................... 4-1
4.1 LOGICAL ARCHITECTURE.............................................................................................................................. 4-2
4.1.1 Capture ................................................................................................................................................ 4-3
4.1.2 Transport.............................................................................................................................................. 4-4
4.1.3 Refine ................................................................................................................................................... 4-4
4.1.4 Store ..................................................................................................................................................... 4-5
4.1.5 Analyze ................................................................................................................................................. 4-5
4.1.6 Distribute ............................................................................................................................................. 4-6
4.1.7 Manage ................................................................................................................................................ 4-6
4.2 FUNCTIONAL ARCHITECTURE ....................................................................................................................... 4-7
4.3 INFRASTRUCTURE ARCHITECTURE .............................................................................................................. 4-10
4.4 NETWORK ARCHITECTURE ......................................................................................................................... 4-12
4.5 SECURITY ARCHITECTURE .......................................................................................................................... 4-14
5.0 DATA GOVERNANCE FRAMEWORK ......................................................................................................... 5-1
6.0 OPERATIONAL SCENARIOS...................................................................................................................... 6-1
6.1 END USER USAGE SCENARIOS ...................................................................................................................... 6-1
6.1.1 Data Ingestion...................................................................................................................................... 6-1
6.1.2 Profiling and Cataloging Data.............................................................................................................. 6-3
6.1.3 Manually Edit Data .............................................................................................................................. 6-4
6.1.4 Create, Manage and Share Data sets .................................................................................................. 6-7
6.2 ADMINISTRATOR USAGE SCENARIOS ............................................................................................................ 6-7
6.2.1 User Provisioning ................................................................................................................................. 6-7
6.2.2 System Monitoring ............................................................................................................................... 6-8
6.2.2.1 Log Aggregation ........................................................................................................................................... 6-8
6.2.2.2 Health and Availability Monitoring .............................................................................................................. 6-9
6.2.3 System Patching and Updates ........................................................................................................... 6-10
6.2.4 Backup & Recovery ............................................................................................................................ 6-10
6.2.4.1 Data Backup Schedule ................................................................................................................................ 6-10
6.2.4.2 Virtual Machine (VM) Backup .................................................................................................................... 6-12
7.0 DETAILED DESIGN ................................................................................................................................... 7-1
7.1 INFRASTRUCTURE DETAILED DESIGN ............................................................................................................ 7-1
7.1.1 Application Nodes ................................................................................................................................ 7-2
7.1.2 Enterprise Data Hub ............................................................................................................................. 7-4
7.1.2.1 Master Nodes ............................................................................................................................................... 7-5
HHPC 10061
ii
Verify that this is the correct version before use.
TABLE OF CONTENTS (Cont'd)
7.1.2.2 Data Nodes ................................................................................................................................................... 7-6
7.1.3 Storage Node ....................................................................................................................................... 7-7
7.1.4 System Software Specifications ........................................................................................................... 7-8
7.1.5 Power consumption ............................................................................................................................. 7-8
7.2 SECURITY DETAILED DESIGN ........................................................................................................................ 7-9
7.2.1 Perimeter Security .............................................................................................................................. 7-11
7.2.2 IMPALA Authentication Module ........................................................................................................ 7-11
7.2.3 IMPALA RBAC Module ........................................................................................................................ 7-14
7.2.4 IMPALA Encryption Module ............................................................................................................... 7-19
7.2.5 IMPALA Auditing ................................................................................................................................ 7-21
7.2.5.1 Navigator Audit .......................................................................................................................................... 7-21
7.2.5.2 Navigator Lineage ...................................................................................................................................... 7-22
7.2.5.3 Waterline ................................................................................................................................................... 7-22
7.2.5.4 Server Security Auditing ............................................................................................................................. 7-22
7.2.5.5 Anti-Virus Monitoring ................................................................................................................................ 7-22
8.0 INTERFACES ............................................................................................................................................ 8-1
8.1 INBOUND INTERFACES ................................................................................................................................. 8-1
8.2 OUTBOUND INTERFACES ............................................................................................................................. 8-1
APPENDIX A ACRONYMS AND ABBREVIATIONS ............................................................................................. A-1
LIST OF TABLES
TABLE 4.1-1 MAPPING OVERVIEW: USER ROLES TO IMPALA PLATFORM LAYERS ...................................................... 4-3
TABLE 4.5-1 SECURITY FUNCTION TO TOOL MAPPING ......................................................................................... 4-17
TABLE 6.1-1 SAMPLE USAGE SCENARIOS MAPPED TO IMPALA ROLES AND LAYERS ................................................... 6-1
TABLE 7.1.1-1 IMPALA VIRTUAL INFRASTRUCTURE SOFTWARE AND FUNCTIONS ....................................................... 7-2
TABLE 7.1.1-2 IMPALA APPLICATIONS AND FUNCTIONS .......................................................................................... 7-2
TABLE 7.1.1-3 IMPALA APPLICATION NODE CONFIGURATION .................................................................................. 7-4
TABLE 7.1.2.1-1 IMPALA MASTER NODE CONFIGURATION ........................................................................................ 7-5
TABLE 7.1.2.2-1 IMPALA DATA NODE CONFIGURATION ............................................................................................ 7-7
TABLE 7.1.3-1 IMPALA STORAGE NODE CONFIGURATION ...................................................................................... 7-7
TABLE 7.1.4-1 SERVER-SIDE OPERATING SYSTEMS AND COUNT ................................................................................. 7-8
TABLE 7.1.5-1 POWER CONSUMPTION ................................................................................................................... 7-8
TABLE 7.1.5-2 POWER CONSUMPTION (CONTINUED) .............................................................................................. 7-9
TABLE 7.2-1 PROTECTION POINT TO IMPALA MODULE MAPPING ......................................................................... 7-9
TABLE 7.2.2-1 IMPALA AUTHENTICATION SECURITY MODULE TERMINOLOGIES ...................................................... 7-13
TABLE 7.2.3-1 IMPALA COMPONENT ROLES AND PRIVILEGES ................................................................................ 7-14
TABLE 7.2.3-2 IMPALA SENTRY COMPONENTS AND FUNCTIONS ............................................................................ 7-17
TABLE 7.2.3-3 RECORD SERVICE COMPONENTS AND FUNCTIONS .......................................................................... 7-18
TABLE 7.2.4-1 IMPALA ENCRYPTION COMPONENTS AND FUNCTIONS ..................................................................... 7-20
HHPC 10061
iii
Verify that this is the correct version before use.
LIST OF FIGURES
FIGURE 4.1-1 IMPALA KEY ENABLERS AND PLATFORM LOGICAL ARCHITECTURE ................................................... 4-2
FIGURE 4.2-1 LAYERED FUNCTIONAL VIEW OF IMPALA PLATFORM ..................................................................... 4-7
FIGURE 4.3-1 OVERVIEW OF INFRASTRUCTURE ARCHITECTURE ......................................................................... 4-10
FIGURE 4.4-1 NETWORK ARCHITECTURE OVERVIEW ......................................................................................... 4-13
FIGURE 4.5-1 IMPALA SECURITY OVERVIEW .................................................................................................... 4-15
FIGURE 4.5-2 USER AUTHENTICATION PROCESS OVERVIEW ............................................................................... 4-16
FIGURE 6.1.1-1 DATA INGESTION FLOW ................................................................................................................ 6-2
FIGURE 6.1.2-1 DATA CATALOGING FLOW ............................................................................................................ 6-3
FIGURE 6.1.3-1 PERSISTENT MANUAL EDIT OF GENERATED DATA SET .................................................................... 6-5
FIGURE 6.1.3-2 IN-PLACE ROW/CELL MANUAL EDITS OF GENERATED DATASET ........................................................ 6-6
FIGURE 6.2.1-1 IMPALA USER PROVISIONING PROCESS ......................................................................................... 6-8
FIGURE 6.2.2.1-1 IMPALA LOG AGGREGATION PROCESS .......................................................................................... 6-9
FIGURE 6.2.4.1-1 IMPALA BACKUP SCHEDULE SHORT TERM ................................................................................ 6-11
FIGURE 7.1-1 IMPALA PLATFORM AND HARDWARE POWER DETAILS .................................................................. 7-1
FIGURE 7.1.2-1 ENTERPRISE DATA HUB SERVICE LAYOUT ....................................................................................... 7-5
FIGURE 7.2-1 IMPALA SECURITY OVERVIEW .................................................................................................... 7-10
FIGURE 7.2.1-1 PERIMETER SECURITY ................................................................................................................. 7-11
FIGURE 7.2.2-1 IMPALA AUTHENTICATION FLOW DIAGRAM ................................................................................ 7-12
FIGURE 7.2.3-1 IMPALA AUTHORIZATION FLOW DIAGRAM ................................................................................. 7-16
FIGURE 7.2.4-1 IMPALA ENCRYPTION FLOW DIAGRAM ....................................................................................... 7-19
FIGURE 7.2.4-2 INTEGRATION BETWEEN THE IMPALA DATA RESERVOIR AND
THE IMPALA ENCRYPTION COMPONENTS .................................................................................... 7-21
HHPC 10061
1-1
Verify that this is the correct version before use.
1.0 INTRODUCTION
The System Design Document (SDD) is a compendium of three documents, providing a single source for
requirements, system design, and data design. The functional and non-functional requirements are
drawn from the Information Management Platform for Data Analytics and Aggregation (IMPALA)
System Requirements document. The three elements of requirements, user design, and data design
form the baseline from which to build a set of more technical system design specifications for the final
product, providing both high-level system design and low-level detailed design.
NOTE: For the remainder of this document, the Information Management Platform for Data
Analytics and Aggregation (IMPALA) is referred to as the IMPALA Platform.
The SDD delineates design goals and considerations, provides a high-level overview of the system
architecture, and describes the system data design, the human-machine interfaces, and operational
scenarios. The high-level system design is decomposed into low-level detailed design specifications for
each system component, including hardware, internal communications, software, system integrity
controls, and interfaces.
1.1 PURPOSE OF THE SYSTEM DESIGN DOCUMENT
The System Design document tracks the design activities that are performed to guide the integration,
installation, verification, and acceptance testing of the IMPALA Platform. The inputs to the design
document are derived from the activities recorded in Tasks 1 through 6 of the Statement of Work
(SOW), with the proposed technical solution being the completion of Phase 1-A. With the
documentation of the architecture of the IMPALA Platform and the installation steps taken, the SDD will
be a living document, capturing the details about capability enhancements and system improvements
to the IMPALA Platform to provide users in development of accurate and precise analytical models. The
IMPALA Platform infrastructure team, data architecture team, system integration team, security
management team, project manager, NASA data scientists and users are the intended audience of this
document.
The IMPALA Platform is an assembly of commercial-off-the-shelf (COTS) products installed on an
Apache-Hadoop platform. User interface details for the COTS products will be sourced from the COTS
tools vendor documentation. The SDD is a focused explanation of the inputs, design steps, and
projected outcomes of every design activity for the IMPALA Platform through installation and
validation.
HHPC 10061
2-1
Verify that this is the correct version before use.
2.0 GENERAL OVERVIEW AND DESIGN GUIDELINES/APPROACH
This section describes the principles and strategies used as guidelines in the design of and
implementation of the IMPALA Platform.
2.1 GENERAL OVERVIEW
Wyle Science, Technology, and Engineering Group (Wyle) is the prime contractor of NASA’s Human
Health and Performance Contract (HHPC), providing engineering, clinical, occupational health
surveillance, and flight hardware support to NASA’s Human Health and Performance Directorate
(HH&P). To meet its mission, HH&P collects, analyzes, and generates reports from a plethora of data in
support of crewmember occupational health surveillance and epidemiologic investigation activities,
laboratory testing, crew safety evaluations, scenario modeling, intelligence and evidence gathering,
environmental assessments, and medical countermeasures deployment.
While Wyle is currently able to meet our NASA customer’s data integration needs to meet mission
objectives with the current system and processes, NASA faces a number of process and technical
challenges.
Inefficient, manual processes to gain access to internal and external data
Legacy system challenges that prevent/limit automated integration of data from multiple
sources and a comprehensive view (and reuse) of laboratory, crew health records,
environmental and epidemiological data
Lack of common processes, advanced analytics capability, and formats and controls needed to
effectively analyze, visualize, and share/reuse data across multiple systems and users
Labor-intensive data request fulfillment processes that depend on manual data aggregation and
manual quality control reviews
Limited or inefficient data visualization, management, governance and data sharing capabilities
(tools, skills and processes)
Insufficient controls to store, process, and analyze longitudinal crewmember health data with
changes to metrics, data types, and units over time.
Inability to apply descriptive, predictive, and prescriptive analytics tools across an array of
constantly changing data sets
Inability to capture and retain meta-data on analysis and use of data, in principle, by any and all
users
A more effective method to manage astronaut health data and increase personnel productivity must be
developed. To meet the above challenges and needs, Wyle seeks to implement a data analytics
platform that will electronically integrate, manage, analyze, visualize, report on and create/share
datasets of surveillance, epidemiologic, laboratory, environmental and other data in an efficient, cost-
effective, and timely manner.
The envisioned IMPALA Platform will enable Wyle to standardize a core set of processes, metrics, and
capabilities across multiple monitoring and surveillance activities, human space flight-related research
HHPC 10061
2-2
Verify that this is the correct version before use.
programs, and data requests in a more efficient and consistent manner, as well as integrate new and
existing data types and unstructured data.
The goal of this project is to plan, design, build, test, and deploy an extensible, flexible, and modular
data integration, management, collaboration, analysis, and visualization platform in support of NASA
HH&P’s occupational health monitoring and surveillance activities, and human space flight-related
research activities.
2.2 ASSUMPTIONS/CONSTRAINTS/RISKS
2.2.1 Assumptions
The IMPALA Platform infrastructure is built on both hardware and software technologies that are
influenced by industry standards. Some of these standards are in a state of evolution. To promote the
portability of software applications and to reduce overall infrastructure costs, industry standards will be
used to implement functions where they are deemed appropriate.
The following items are design assumptions for IMPALA’s initial implementation.
1. The MEME network team will provide cabling between the IMPALA rack and the Mission
Extended Medical Enterprise (MEME) network.
2. End users laptops and /or workstations used in accessing the IMPALA platform will be provided
by NASA JSC and not as a not part of the IMPALA platform.
3. Upstream connectivity to the MEME network will provided by the MEME network team.
4. The MEME network and security team will provide security and system health monitoring tools
to ensure the health and security posture of the platform.
5. The IMPALA platform will leverage the existing MEME network virtual center metadata store,
which uses an external SQL database.
6. Infrastructure services such as Domain Name Server (DNS), Active Directory (AD) are not part of
the IMPALA platform but made available from the Information Resource Directorate (IRD)
through the Network Access Control Board (NACB).
7. The IMPALA platform will not automatically patch or update applications or operating systems.
Patching and updating schedule is dependent on the MEME environment.
8. Routable addresses used for end user communication are supplied by the MEME network team
and not a part of the IMPALA platform design.
9. The MEME network team, to cover all required IMPALA systems, will provide non-routable
addresses for internal (intra-IMPALA systems) rack communications.
10. Power consumption for each IMPALA rack is not to exceed 5000 watts.
11. The IMPALA platform websites and/or web services will be registered in the System for Tracking
and Registering Applications and Websites (STRAW) by the HH&P IT Security team.
HHPC 10061
2-3
Verify that this is the correct version before use.
2.2.2 Constraints
1. The IMPALA platform will only be accessible through NASA networks via approved, dedicated
connections or Virtual Private Network (VPN) [e.g., the MEME- Secure Socket Layer (SSL) VPN
appliance].
2. The NASA/Wyle Technical Review Board (TRB) must approve all software and hardware
technologies and products selected by the Contractor before implementation.
3. The IMPALA Platform constraints may be influenced by NASA JSC Security center policies,
procedures and protocols.
HHPC 10061
3-1
Verify that this is the correct version before use.
3.0 DESIGN CONSIDERATIONS
3.1 GOALS
The goals of the IMPALA system are to:
Increase accessibility to accurate and actionable data without compromising security
Increase confidence in data through defined data governance processes and controls
Enable users to improve the quality of data analysis, reports, recommendations, and decisions
substantiated by data stored in the IMPALA Platform
Seamlessly ingest, integrate, and manage clinical, life sciences, epidemiological, environmental,
laboratory and astronaut's longitudinal health monitoring and surveillance data
Provide automated processes to extract, cleanse, validate, transform and curate data in
multiple formats from a variety of sources based on analyst defined rules and controls
Extract and store metadata and establish relationship between known entities and fields in the
source data
Provide a highly scalable and available infrastructure to store, process and analyze data as well
as continuously refresh the data from the source data
Facilitate stakeholder collaboration and real-time data sharing internally and externally with
trusted partners, universities and other government agencies to facilitate collateral exploratory
analyses and hypothesis testing by trusted outsiders and for more extensive peer review
Provide a secure single point of access to the data in the platform for all approved users across
multiple end user devices (mobile, desktop, laptop and tablet) with appropriate platform-
specific security protocols
Comply with NASA's security requirements and be certified to operate within the NASA
environment and comply with federal requirements for security and handling of medical PII
Interoperate with existing NASA and HHPD systems
Analyze and visualize data from multiple data sources in a single unified view including data
from external investigators and programs
Correlate, aggregate, create and share datasets from multiple data sources for analysis by
internal and external scientists as well as reuse these datasets as sources for future data
requests
Provide advanced analytics capabilities to perform data mining, data exploration and discovery
that retains meta-data on analyses and uses of the data by all users
Support timely processing and provisioning of new data sources for platform users
3.2 DEPENDENCIES
Implementing the IMPALA system is dependent upon the following:
MEME Infrastructure team will provide the tools for monitoring the health of the IMPALA
platform.
HHPIT security team will provide the tools such as the CIS benchmarking software for hardening
the IMPALA systems.
HHPC 10061
3-2
Verify that this is the correct version before use.
HHPIT security team will provide the security tools such as log aggregation agents, and
processes for auditing the IMPALA platform.
MEME Infrastructure team will provide Antivirus package and license instructions for agents
installed on the IMPALA servers.
IMPALA virtual center database will be backed up by the MEME Infrastructure team post
deployment of the platform.
The MEME Infrastructure team will execute updates and patching of the IMPALA platform’s
Operating Systems.
A Domain Name System (DNS) will be provided by the MEME network for identifying servers
and service.
JSC Active Directory will be provided by the MEME for user profiles and authentication, as well
as group membership.
Access to the Launchpad application will be provided for PIV card authentication.
MEME Network will be provide network connectivity for access to the IMPALA platform for
connectivity and access.
MEME Infrastructure team will generate template for user requests through NASA Access
Management System (NAMS)
MEME will provide the peripherals needed for testing the IMPALA platform within the JSC
environment (laptops, network connection to IMPALA platform)
HHPC 10061
4-1
Verify that this is the correct version before use.
4.0 OVERALL SYSTEM ARCHITECTURE
The IMPALA platform uses an open framework designed to make adding, updating or swapping
components easy. The IMPALA platform is also designed to ensure scalability (ability to grow to meet
the analytics needs of NASA), availability (ability to ensure both data and systems are available to users)
and security (ensuring that the both data and systems are protected against attacks or unauthorized
access.
The IMPALA platform provides the following additional benefits to the end-user:
Ease of use
Data Cataloging
High Speed Search
Collaboration
Self-service Dashboards and reports
The IMPALA platform will reside within the MEME network, which in turn resides within the Johnson
Space Center (JSC) network. The sub-sections to follow provide an overview of the IMPALA platform
design from the following views:
Logical
Functional
Infrastructure
Network
Security
These views will be described from a user’s perspective and provide functional descriptions of the
components within the platform. Section 7 of this document will expand on these descriptions at a
more technical level.
HHPC 10061
4-2
Verify that this is the correct version before use.
4.1 LOGICAL ARCHITECTURE
Figure 4.1-1 IMPALA Key Enablers and Platform Logical Architecture
HHPC 10061
4-3
Verify that this is the correct version before use.
As depicted in Figure 4.1-1, the IMPALA Platform enables the NASA community of users to operate
within seven (7) key logical layers: Capture, Transport, Refine, Store, Analyze, Distribute, and Manage
Data, using the suite of components within the platform. It also supports the governance of data and
collaboration amongst the users. This section gives a brief description of the seven (7) logical layers, the
functions performed within the layer and the user community’s interaction with each layer. Table 4.1-1
provides a brief overview of the different user role groupings, as well as how they interact with the
layers of the IMPALA platform.
TABLE 4.1-1 MAPPING OVERVIEW: USER ROLES TO IMPALA PLATFORM LAYERS
Layers IMPALA Roles Function
Capture
Data Owners
Identify data sources to be ingested into IMPALA
Transport
Developers
Connect to multiple data sources
Build ETL workflows to ingest data into IMPALA data reservoir
at scale
Apply common rule-based transformations to data on ingest
Refine
Data Scientists, Data
Owners, Data Stewards
Profile data for anomalies and correct
Catalog the data by tagging
Address exceptions/errors identified during data ingest
Store
Administrator
Ensure the creations and proper access control for data landing
zones
Ensure health of data reservoir
Ensure accurate logging and auditing
Analyze
Data Scientists, Data
Analysts
Data set creation and management
Statistical and machine-learning analysis
Generation of dashboards and reports
Exploration of data
Searching, mining and ad-hoc querying
Distribute
Data Scientists, Data
Analysts
Publish insights
Package and publish generated data sets
Manage
Administrators,
Developers, Data
Stewards
Continuous monitoring of IMPALA infrastructure health.
Ensure that data is appropriately cataloged within the data
reservoir
4.1.1 Capture
The capture layer is focused on identifying data sources that are used in servicing the functions in the
analyze layer. This layer is primarily process driven and requires domain knowledge to identify the data
sources needed. Data that will initially be captured from existing data sources within the NASA
environment, which includes the EMR (Electronic Medical Record), the LSAH (MEDB Lifetime
HHPC 10061
4-4
Verify that this is the correct version before use.
Surveillance of Astronaut Health) and MEDB Sharepoint. Data sources in the capture layer are
technically outside the IMPALA platform. IMPALA interfaces with these data sources to ingest data.
The IMPALA platform is extensible and able to capture data from other types of data sources. The
platform also supports connectivity and capture/identification of relational data sources such as
ORACLE, Microsoft SQL Server, and Microsoft Access, as well as data located in file shares, cloud based
data stores and local files of varying formats (XML, CSV, JSON, PDF, and more).
The process of identifying these data sources is a collaborative effort between the HH&P IT and the
Domain subject matter experts (SMEs). The SMEs define the data of interest and the HH&P IT teams
identify the appropriate data source within the environment containing the requested data.
During this capture layer, multiple data sources are identified for transport (import) into the IMPALA
platform.
4.1.2 Transport
The transport layer of the IMPALA platform focuses on the movement of data from the production data
sources, present and future, to the IMPALA platform data store. Where the Capture layer focuses on
the identification of the data, the Transport layer contains the pipeline for bringing the data into the
IMPALA platform.
The IMPALA platform’s transport layer is not just limited to RDBMS tools previously listed (ORACLE,
Access, and MS SQL), it enables connection to any relational data source that supports ODBC or JDBC
connectors, connection to file shares and connection to data residing in the cloud.
The HHP&IT team will leverage the functions within this layer to build, test and deploy data capture
jobs. These data capture jobs will be scheduled to connect and load data from the data sources (EMR,
LSAH, etc.). As the list of production data sources grows beyond the initial data sources, this team will
support the design and development of the required data capture jobs. This team also manages the
interface to the production data source. The interfaces for relational data stores include ODBC and
JDBC as well as the credentials to connect to these production data stores. The IMPALA platform uses
built-in functions, such as secure copy or secure file transfer protocols, to import non-relational data
sources such as file. The data capture jobs are orchestrated using the Pentaho engine. Within this
engine, business rules are defined to apply static transformations to the data as it is transported into
the data reservoir.
4.1.3 Refine
In the IMPALA refine layer, data that is ingested through the transport layer are refined (selected,
standardized, tagged, categorized, summarized) and enriched either manually or automated. Users are
able to review ingested data and apply transformations or error cleaning logic to it. SMEs are also able
to use their knowledge to enrich the data with tags, labels and comments, as well as creating newly
derived fields. The data in this layer is transient, which mean it is not in its final state. It is also
important to note that all transformations or profiling functions applied to the data in this layer are
performed against an internal IMPALA copy of the original data (from the Transport layer), not the data
source.
In this layer the users, Developers, Data Owners, Data Stewards, Data Scientists and Data Analysts,
crowd-source (i.e collaborate) on their knowledge of the domain to create profiling jobs that enable the
HHPC 10061
4-5
Verify that this is the correct version before use.
cataloging of data going into the IMPALA store. These jobs can then be scheduled or called on demand
by components of the Transport layer.
The greatest benefit of the work done in this layer is that it allows for the improved efficacy of searches
against and analysis of data performed by users such as data scientists, and data analysts. The SME
community tags each data table and field that passes through the refine layer with common terms used
by end users as defined/refined.
4.1.4 Store
The IMPALA store layer provides the landing zone for captured, transported and refined data. This
landing zone is the distributed IMPALA data reservoir. Three (3) key principals govern the design of this
layer:
Scalability
The IMPALA store layer is built using the Apache Hadoop platform. Apache Hadoop is an open
source software platform for distributed storage and processing of very large data sets on
server clusters built from commodity hardware. Because it is built on a cluster of servers,
scaling to match growth in data is as simple as adding more server hard drives or adding
another server. As the data analysis needs of NASA grow, the data size will grow. The IMPALA
platform store layer ensures that when this growth occurs the system can accommodate it.
Redundancy
The distributed storage design of IMPALA’s Apache Hadoop data store layer ensures that every
block of data is replicated 3 times across the cluster of servers. This means that even with the
loss of a server or a hard drive, the system continues to function as normal. Built in
management modules, described in section 7, ensure that when a hardware failure occurs, an
administrator is notified. Once the server or hard drive is repaired/replaced, data is copied back
to the server/drive as it resumes as part of the cluster. In addition, this redundancy allows data
processing on a given segment of the data to be performed on any of its available copies,
reducing data processing bottlenecks and improving performance on large datasets.
Performance
The Apache Hadoop platform supports distributed processing and the IMPALA platform design
leverages this feature by ensuring that all end user jobs/functions, such as searching, enriching,
cataloging and transforming data, are performed by all the nodes/servers in parallel in the
cluster.
With these three (3) key principals, the IMPALA platform is ensured a quick response time as well as the
confidence of data safety.
Another key function of the store layer is its ability to store data of all types. Relational databases or
file-based data such as pdfs, CSVs, DOCs, XMLs, JPGs, PNGs and more can be stored and coexist within
the IMPALA store layer. They can also be blended together.
With this layer, the users of the IMPALA platform gain access to one aggregated data source with a
variety of data types and they are able to process and use these datasets for analytics and search.
4.1.5 Analyze
The analyze layer leverages the processing power of the store layer and provides components that
enable searching and mining of data, ad-hoc querying at scale, collaborative data request fulfillment,
HHPC 10061
4-6
Verify that this is the correct version before use.
reports generation and data exploration. The data analysis and visualization tools provided with the
IMPALA platform work with the Apache Hadoop platform to handle the distributed processing
intrinsically, providing the end users with familiar functions for joining, searching, querying, and
analyzing their datasets. The user interfaces in this layer are all web-based and many provide drag and
drop capabilities and immediate feedback on a sample of the data.
NASA data scientists and analysts use the components of this layer to consume data, deliver insights,
manage relationships between different data sources, and create and edit data sets for fulfilling data
requests.
The users access this layer through a web-based portal after authenticating against the JSC active
directory domain. The IMPALA platform uses role based access control to ensure that a user is only
allowed to access or analyze the data for which they have been granted permission.
4.1.6 Distribute
Data sets created in the analyze layer may distributed or published to other users using the components
within this layer. Internal JSC users can authenticate into the IMPALA platform and view or download
generated data sets, analytic results, or visualizations.
4.1.7 Manage
In this layer, the Data Stewards and Developers develop scripts for data loading and transformation,
cataloging of data, and tagging data based on domain knowledge. These scripts are tested and then
packaged for use in the capture, transport and refine layers. The packaged scripts are accessible as web
services or through command line interfaces.
Administrators use the components of this layer to manage access and privacy of data.
HHPC 10061
4-7
Verify that this is the correct version before use.
4.2 FUNCTIONAL ARCHITECTURE
Figure 4.2-1 Layered Functional View of IMPALA platform
HHPC 10061
4-8
Verify that this is the correct version before use.
This section describes the functional purpose of each layer mentioned in the Logical Architecture
(section 4.1). It also briefly describes how the users will gain access to and interact with the platform.
The IMPALA platform will leverage the existing Johnson Space Center (JSC) NASA Access Management
System (NAMS) and process to provision users (i.e., Users will request access to IMPALA through
NAMS). An IMPALA review board reviews user access requests for approval. The IMPALA administrator
creates the user profiles and provisions the appropriate role within the IMPALA system after the NAMS
request is approved.
Based on the current MEME infrastructure, users with appropriate credentials interact with the IMPALA
platform by connecting to the JSC network from an onsite workstation or through the JSC VPN and
authenticating into the IMPALA platform.
The IMPALA portal is the web-based graphical user interface (GUI) used for accessing all the
components described in the Logical Architecture view. Access to this interface is through a web
browser on the user’s desktop workstation.
Authenticated and approved users are able to create information dashboards and visualizations,
connect to multiple data sources, define and generate a data catalog, search for data and wrangle (or
clean) data pre or post search. Each component behind the IMPALA portal is web-based and provides
drag and drop or point and click capabilities for ease of use.
The transport functional layer enables the hydration of data into the data reservoir from initial data
sources identified such as the EMR (Electronic Medical Record) or the LSAH (Lifetime Surveillance of
Astronaut Health), which reside on two (2) different relational databases (RDBMS) Oracle and
Microsoft SQL Server (MS SQL) in the capture process. Data is processed, cleaned, and profiled during
import for the purpose of search, analytics or reporting.
Developers create Extract, Transform and Load (ETL) jobs in the transport layer using Pentaho Data
Integration tools to move/copy data. These jobs, once validated, are scheduled to ensure the data
reservoir contains fresh data. The transport layer also functions as an orchestration engine by allowing
external processes such as transformation scripts built in the refine layer, using Trifacta, to be called
outside of the standard copy of data.
The refine functional layer provides components such as Trifacta and Pentaho to create and apply rule-
based data transformations before load. Transformations created and packaged in Trifacta for
repeatable cleansing of data are referred to as static transformations. These static transformations are
provided as steps in the transport layer mentioned above.
The analyze functional layer described in section 4.1.5 consists of the components that:
MASH reports, Dashboard, Data Catalog, Search and Data Wrangling
enable the creation of repeatable workflows that pull data from the data reservoir,
establish collaboration workspaces for different risk specializations,
connect to and pull data from other data sources to enrich searches,
HHPC 10061
4-9
Verify that this is the correct version before use.
create dashboards for reporting and informing flight surgeons and finally for creating datasets
that fulfill request for data.
The functionality within the analyze functional layer allows for the creation of MASH reports,
Dashboard, Data Cataloging, Searching and Data Wrangling as depicted in figure 4.2-1
Behind these components is the Distributed Data Reservoir (in the store layer). This component
supports the authentication and authorization within the IMPALA platform data reservoir. Each of the
services in the other layers must authenticate against the data reservoir in order to perform any
functions with the data reservoir. This authentication is controlled using Kerberos. Authorization at the
data level is accomplished via role-based access control (RBAC) using Sentry software. This layer also
enables the encryption of data at rest for privacy. Details on the authentication and encryption are
discussed section 7 of this of this document.
The IMPALA NAS (Network Attached Storage) will be used to backup the data reservoir. This process is
described in section 6.2.4.
The IMPALA platform leverages existing infrastructure management functionality within the MEME
network by providing system SNMP (Simple Network Management Protocol) traps to existing health
management tools such as SolarWinds. The MEME environment currently uses an agent-based log
aggregation tool for security auditing. To ensure compliance, these agents are installed on the MEME
servers for security auditing purposes. The MEME environment currently uses an agent-based tool, Dell
KACE, for inventory and system management. The servers within the IMPALA platform leverage the
Computer Information Security benchmark to ensure appropriate hardening of the servers.
The IMPALA platform provides an issue-tracking interface, using JIRA, that enables HHPIT and IMPALA
administrators to post issues within the platform. These issues are tracked through to resolution using
the IMPALA issue-tracking component (JIRA). JIRA also enables the tracking of updates to the IMPALA
platform by entering and tracking update requests for the new releases of functionality.
These layers work together to allow users of the IMPALA platform to work collaboratively to meet the
goals defined in section 3.1.
HHPC 10061
4-10
Verify that this is the correct version before use.
4.3 INFRASTRUCTURE ARCHITECTURE
Figure 4.3-1 Overview of Infrastructure Architecture
HHPC 10061
4-11
Verify that this is the correct version before use.
The IMPALA Platform is a self-contained rack located in the building 46 Data Center in room 300. It is
interconnected electronically with the MEME environment behind the MEME firewall. The sub-
components of the IMPALA platform are deployed as a combination of virtual machines and physical
servers. The user facing components are deployed as virtual machines across three physical hosts,
called application nodes. The IMPALA distributed data reservoir is deployed across six (6) physical
servers (2 Master Nodes and 4 Data Nodes). Communication between the application nodes and the
data nodes within the rack is through services on the Master nodes over a 10Gbe network.
The IMPALA rack contains four switches, two (2) 10Gbe switches in an HA configuration for intra-rack
communication and two (2) 1Gbe switches in an High Availability (HA) configuration for
user/management communication. The user/management switch connects to the MEME firewall for
user and management access. The IMPALA rack also contains an iSCSI Storage device used as a central
store for VMs and for backup of data within the reservoir. The backup strategy is outlined in the
Operational Scenario section.
Data from data sources such as the Electronic Medical Record (EMR), the Lifetime Surveillance of
Astronaut Health (LSAH) and Medical Evaluation Document part B (MEDB), as well as future data
sources, will be ingested into the IMPALA data nodes. Layout and landing zone information as well as
the governance processes for data loads are in the Data Governance Framework document.
As stated in section 4.2, the IMPALA Platform leverages the following existing processes and tools
within the MEME environment for health management and security posture:
Configured SNMP traps for monitoring IMPALA Platform server components such as CPU, disk,
memory etc., as well as availability or uptime of the servers forwarded to the infrastructure
management tool provided by the MEME infrastructure team.
Anti-virus agents, as provided by the MEME infrastructure team, are deployed on IMPALA
servers
Server hardening leveraging the Center for Information Security (CIS) benchmarking
requirements as provided by the NASA IT security team
Deployment of log aggregation agents as provided by the HHPIT security team for auditing
Patches to the operating system and upgrades of the individual components of the IMPALA platform
will adhere to the policies set forth by the NASA IT security team. Section 6.2.3 details the process for
patching the IMPALA platform.
The IMPALA Platform physical architecture supports high performance through the following means:
Clustering
The data nodes are clustered enabling sharing of the workload across commodity hardware.
The Application server hosts are also clustered to ensure that the loss of one application server
does not lead to a drop in unavailability of any of the applications
RAID Configuration
The OS hard disks for the servers within the IMPALA Platforms are in RAID-1 configuration to
ensure continued operations in the event of a hard disk failure. The storage device is
configured for RAID-6
HHPC 10061
4-12
Verify that this is the correct version before use.
Distributed Storage
Each block of data is triple replicated across multiple disks and servers to ensure that the loss of
one drive or server does not lead to data loss
Distributed Processors
The IMPALA Platform leverages the processors/cores across all the servers within the cluster to
ensure requests such as searching, data mining or analytics are processed fast and in a parallel
fashion
Network Throughput
A 10Gbe intra-rack backbone ensures high speed data transfer between the master nodes, data
nodes and application nodes.
The local iSCSI storage device within the IMPALA rack serves the following purposes:
VMDK (Virtual Machine Disk) Storage
Each application host (also known as ESXi servers) will host the virtual machines on which each
component (or application) runs. Each virtual machine (VM) will have its disk located on the
iSCSI storage devices. This design enables high availability within IMPALA’s virtual
infrastructure.
Backup
Section 6.2.4 describes the backup process, schedule and methodology within the IMPALA
platform. The following general items will be backed up to the storage device by default:
o Application configuration files
o Application metadata information such as project files or metadata database generated
as a result of usages
4.4 NETWORK ARCHITECTURE
The IMPALA platform operates on four main networks:
User Network
A 1Gbe network dedicated to end user communications. Two (2) IMPALA Top-of-Rack (TOR)
switch configured with HA support this network.
Internal Cluster Network
A 10Gbe network dedicated to communications between the applications and services within
the IMPALA platform. Two 48 port 10Gbe switches support this communication. These
switches are in a High Availability (HA) configuration
1. Internal Virtual Machine (VM) Network
A 10Gbe network dedicated to communications between the virtual machines and storage unit
within the IMPALA platform.
Management Network
A 1Gbe network dedicated to management of platform devices and monitoring the health of
the IMPALA platform. A 1Gbe network dedicated to out-of-band (OOB) management of the
servers within the IMPALA platform.
HHPC 10061
4-13
Verify that this is the correct version before use.
The connections to this network and how the IMPALA platform is connected to the MEME network is
depicted in Figure 4.4-1.
Figure 4.4-1 Network Architecture Overview
Users (Data Analysts, Data Scientist, Researchers, etc.,) interact with the IMPALA platform by
connecting to the JSC network from an onsite workstation or through the JSC VPN. Each server within
the rack is fitted with multiple Network Interface Cards (NICs), as depicted in figure 4.4-1.
The two-10Gbe NICs are configured in a NIC Team using NICS Teaming a process of combining NICs
together for performance. These two NICs support communications on the 10Gbe Internal Cluster
Network. The Internal Cluster Network is used for communication between the applications located on
the virtual machines (VM) and the data reservoir nodes (master and data nodes). It is also used for
communication between the different components of the IMPALA data reservoir, for example,
communication between the resource manager and the data nodes.
The storage device and the application hosts (VM hosts) are also configured a 10Gbe NIC each. These
enable communication over the internal VM Network.
For communication on the User Network, each server is configured with two-1Gbe NICs also configured
in a NIC team for performance and high availability.
For communication over the Management Network, each server is configured with two-1Gbe NIC. The
first is for monitoring the IMPALA platform via management tools, as well as for patching and other
HHPC 10061
4-14
Verify that this is the correct version before use.
administrative functions that affect the OS of the servers. The second NIC is for Out-Of-Band (OOB)
communication using the iDRAC (integrated Dell Remote Access Controller). The iDRAC provides
functionality that helps in deploying, updating, monitoring and maintaining the servers with or without
a system management software agent.
4.5 SECURITY ARCHITECTURE
Security within the IMPALA platform is designed to protect the data via five (5) discrete layers:
Perimeter Security, Authentication, Authorization, Encryption and Policy.
HHPC 10061
4-15
Verify that this is the correct version before use.
Figure 4.5-1 IMPALA Security Overview
HHPC 10061
4-16
Verify that this is the correct version before use.
Perimeter security refers to the network controls that protect network access to the IMPALA platform.
The IMPALA platform is within the MEME environment, which is guarded by the MEME Firewall. For
any user to gain access to the MEME environment, they must first establish a secure Virtual Private
Network (VPN) connection.
Both users that access the IMPALA platform and services components that operate within the IMPALA
platform must authenticate to gain access to or perform any operation against the data. Users request
access to the IMPALA platform through the NAMS system. The MEME administrator based on defined
policies established in the data governance guide fulfills these requests. Users authenticate through the
NASA LaunchPad system (see Figure 4.5-2 below).
Figure 4.5-2 User Authentication process overview
Service accounts are created within the IMPALA platform for each component to communicate with the
data reservoir. These service accounts are created in Kerberos Key Distribution Center (KDC) so each
service has its own unique credentials (as keytabs) to access the data in the data reservoir and to
execute tasks against that data. This process is described in detail in section 7.
For authorization, the IMPALA system leverages a role-based access control system at the data and
component layers of the platform. It also leverages groups defined in active directory via a one way
trust between the NDC active directory and the Kerberos KDC. Sentry, described in more detail in
section 7, is used to assign privileges based on roles to each data entity within the IMPALA platform.
All data in the data reservoir is encrypted using AES-256 encryption. This ensures that malicious access
to the data does not lead to release of PII information.
Per the JSC/HHPIT security guidelines, the server OS’s are hardened using the Center for Internet
Security (CIS) guidelines to ensure a proper lockdown of the system. Antivirus and monitoring agents
are installed on all servers within the IMPALA platform. Logs for systems and application are configured
to be captured using a log aggregation tool specified by the HHPIT security team.
The table below shows a breakdown of each security function and which tool or component supports it.
HHPC 10061
4-17
Verify that this is the correct version before use.
TABLE 4.5-1 SECURITY FUNCTION TO TOOL MAPPING
Security Function
Provided by
Native/Leveraged
Perimeter Security
MEME Firewall/VPN
Leveraged
Access Control
Sentry
Native
Access Requests
NAMS
Leveraged
System Monitoring
SolarWinds
Leveraged
Security Audit Logging
Splunk
Leveraged
Application/System Audit Logging
Splunk
Leveraged
OS Hardening
CIS Benchmark
Native & Leveraged
Encryption of data at rest
Navigator Encrypt /AES-256
Native
Encryption of data in motion
TLS/SSL
Leveraged
Compliance Audit Support
Navigator Audit/Navigator
Lineage
Native
Key Management
KMS/KTS
Native
Malware & Harmful Code
Protection
TrendMicro
Leveraged
User Identity
NDC AD
Leveraged
Application Access
Kerberos
Native
User Access
Launchpad
Leveraged
Inventory/System Management
Dell KACE
Leveraged
Vulnerability Scanning
MVM
Leveraged
Software Vulnerability & Memory
Protection
EMET
Leveraged
HHPC 10061
5-1
Verify that this is the correct version before use.
5.0 DATA GOVERNANCE FRAMEWORK
A Data Governance Framework will be built for the NASA HHPC program to ensure clear communication
within the IMPALA team and across all organizations that it touches and to maintain scope and focus,
establish accountabilities, and define measurable successes. The details of the Data Governance
Framework are presented in the Data Governance Framework document.
HHPC 10061
6-1
Verify that this is the correct version before use.
6.0 OPERATIONAL SCENARIOS
The IMPALA Platform provides a single source of data for NASA data analysts, data scientists, and other
users to locate data, perform analysis, and share their investigations. The scope of functionality
provided by the tools in the IMPALA platform allows for a large number of operational scenarios. This
document will cover some of those operational scenarios most likely at the initial onset of use.
Initially, the IMPALA data reservoir is hydrated by scheduled Pentaho jobs that run either on demand or
on a scheduled basis (recurring daily or weekly) to refresh the data. On unsuccessful loads, there will be
exceptions and error logs generated. After data ingest, ETL engineers will use Trifacta to confirm data
or cleanse and transform it. Exception handling will also be handled at this stage and logged into the
system. After the data is imported and cleansed, it is ready for use by the Data Scientists and other end
users. Section 6.2, describes expected Operational Scenarios using the data imported into the IMPALA
platform.
6.1 END USER USAGE SCENARIOS
This section describes initial usage scenarios that portray end user experiences with the IMPALA
platform. These usage scenarios are grouped as End User and IT User scenarios. Table 6.1-1 is a
summary of some of the usage scenarios that are enabled by the IMPALA platform, which layer of the
platform they apply to, and which general IMPALA roles operate within those layers.
TABLE 6.1-1 SAMPLE USAGE SCENARIOS MAPPED TO IMPALA ROLES AND LAYERS
User Experience of Applies to Group Applicable IMPALA layer
Perform Data Ingestion at scale
Developer
IT User
Transport and Store
Profiling, Cataloging and managing data
Data Owner,
Data Steward
End User
Refine
Manually edit data
Data Steward,
Data Owner
End User
Analyze
Create, manage & share data sets
Data Scientist,
Data Owner,
Data Scientist
End User
Analyze and Distribute
6.1.1 Data Ingestion
The IMPALA platform leverages Pentaho Data Integrator (PDI) for ingesting data into the IMPALA data
reservoir. Pentaho Data Integrator enables developers on the HH&P IT team to create custom
workflow based jobs to pull data from the source, transform and push data into the data reservoir.
These jobs are executed on demand, or are scheduled, to hydrate the data reservoir from data sources
such as the EMR, LSAH, MEDB, and file shares hosting generated reports. The data ingest process
includes identifying data sources to ingest, identifying static transformations that need to be applied to
the data upon ingest, tracking of requests to create repeatable data ingest pipelines, creating and
executing repeatable pipelines.
HHPC 10061
6-2
Verify that this is the correct version before use.
Figure 6.1.1-1 Data Ingestion Flow
The Data Governance board identifies data sources that should be ingested into the IMPALA data
reservoir. Data scientists, Data Owner and Data Stewards define:
Filters to ensure that data pulled into the data reservoir are relevant.
Static transformations or business rules, like unit changes or dealing with nulls, that need to be
applied to the data upon ingest.
Refresh intervals and other ingest parameters for the HH&P IT team to create data ingest
pipelines
The identified transformations, identified data sources, and ingest parameters are transformed into
requirements that are tracked within the JIRA component. The HH&P IT team reviews requirements
within JIRA and create:
Sqoop jobs for reading data from a relational database and writing the data into IMPALA.
Modular scripts used to transform the data using Trifacta and custom scripts.
Workflows which orchestrate the connection to the data sources, the querying and filtering of
the data, and the transformation of the data as the data is pushed into the data reservoir.
In addition to creating workflows that are based on business rules (static transformations),
transformations created post initial data load using Trifacta can also the packaged and applied to future
ingests. The analysts reviews the data in the data reservoir and identifies necessary transformations
that need to be persisted. These transformations are then created using Trifacta and the ensuing script
is added to the data ingest workflow pipeline for future/subsequent data ingests.
HHPC 10061
6-3
Verify that this is the correct version before use.
6.1.2 Profiling and Cataloging Data
As part of a Data Governance framework, profiling and cataloging are organization techniques that
ensure data can be located and tracked within the data reservoir. One of the benefits of a data
reservoir include providing a single source for data of all types and origins. However, due to the size
and scope of the data reservoir capabilities, if organization techniques are not employed as the
reservoir fills, locating the desired information for analysis or visualizations can become increasingly
difficult. The Data Governance framework provides these processes to keep the data organized.
Figure 6.1.2-1 Data Cataloging Flow
The initial step for organization is determined by the folder structure used for storing data. The
incoming data is held in a landing zone within the IMPALA data reservoir while it is being profiled and
cataloged, then it is moved to a pre-established destination. Datasets and extracts created from the
original files can be tracked through the IMPALA platform using lineage and pedigree tools.
Profiling establishes metadata for incoming data on both the file and data element levels. The IMPALA
platform provides tools that perform automated profiling and tagging when the data is in the landing
zone within the IMPALA data reservoir. In addition to the tags provided by the tools, the user can
define business rules to customize tags based on the content of the incoming data. Once cataloging
and tagging jobs have been defined, a Pentaho Data Integrator workflow is defined and developed by a
developer to automate the process of cataloging and profiling when new data is placed into the landing
zone.
Cataloging the profiled data allows users to search the metadata and apply custom tags to quickly
locate files with specific content.
The Data Governance board provides the processes that will be followed for both automated and
customized profiling, cataloging, and file hierarchy. Collaboratively, Data scientists, Owners, Stewards
and Analysts define:
HHPC 10061
6-4
Verify that this is the correct version before use.
Filters to ensure that data pulled into the data reservoir is relevant.
Domains and tags for that data
Filters and tags to define subject-area datasets
Business rules for custom tagging
Metadata searches
6.1.3 Manually Edit Data
There are many operational reasons for approved users to edit data in the data reservoir. The tools and
methods used to edit data depend on the length of time that the edit should persist, if at all they should
persist. There are generally three types of edits to data within the IMPALA platform, Persistent/Static
Transformation edits, Non-persistent/Analytic edits and In-place edits.
Persistent/Static transformation edits are applied to data prior to its landing in the data reservoir as
well as errors identified that need to be addressed for every data ingest after the initial load. The edits
that occur as data is being ingested the first time are applied using business rules built into Pentaho
workflows. The process for these edits is described in the steps below:
1. Subject Matter Expert/Analysts identifies a change that needs to be made to data before
ingestion based on historical knowledge or experience
2. A request is posted to the Developer to add business rules that edit the data in transit into the
data reservoir
3. Created business rules are inserted as steps in the data ingest workflow ensuring that the edits
are consistent and made on every data ingest
In other scenarios of persistent/static transformation edits, a subject matter expert or analyst identifies
error in the data while working with the data in the reservoir. In this scenario, the analysts uses Trifacta
to make edits to the data this process in-turn generates a scripts that can be called by the data ingest
orchestration workflow ensuring the transformation or edit is applied on subsequent ingests. This
process is depicted in Figure 6.1.3-1 below.
HHPC 10061
6-5
Verify that this is the correct version before use.
Figure 6.1.3-1 Persistent Manual Edit of Generated Data Set
Data that is generated and stored within the IMPALA data reservoir such as results of analysis or search
requests, are pulled into the wrangling tool, Trifacta. The analyst, who understands the data and the
edits that need to be made applies that transformation, executes the script, and validates the resulting
dataset. Execution of the script creates a new data set in the Data Reservoir with all the metadata
associated with the edit/transformation. A request is then sent to the ETL developer to add this script
to the data ingest process for persistent edits of data with the same structure.
Persistent edits or static transformations may include the following:
Decomposing a compound field into multiple fields
Unit conversions to standardize the measurement system for data of the same type (such as
dates, height, weight)
Removing incorrect values
Adding additional information
Adding lookup or descriptive information
Tools used for persistent edits:
Trifacta scripts integrated into the Pentaho workflow
Non-persisted edits generally describe edits that are made during the course of analyzing or dealing
with the data in the data reservoir. They are performed during analysis and modeling for a particular
purpose. These edits may be documented and shared with other users, but the edits only apply to the
data as it is being used in the analysis. In this type of edit, the data is not persisted and it is only used
for the duration of the analysis.
Types of edits performed during analysis may include:
HHPC 10061
6-6
Verify that this is the correct version before use.
Removing null values
Filtering outliers
Deriving or calculating new fields
Tools used for edits during analysis:
Trifacta scripts
Alpine cleansing, filtering, or SQL functions
Regardless of which of the methods described above is used, the source data within the IMPALA
platform is never altered during editing. A copy of both the original data and the edited data are saved,
which allows for tracking and auditing of all data changes.
One last type of edit is in-place or row/cell level edits. These edits do not need to be persisted and do
not apply to the data-at-large. In this scenario, the analyst may recognize something as simple as a
misspelled word or unit on a particular row and cell within the generated dataset. In this scenario, an
out flow edit is required. First, the user must download the data set in a tabular form to their desktop.
Then using a tool such as excel, modify the row/cell with the errant data value. Once the error is
corrected, the data can then be uploaded back into the data reservoir in a location defined by the
governance process, if there is a requirement to track such change. Metadata that needs to apply to
this process is also defined in the data governance document. This metadata is used for auditing.
Figure 6.1.3-2 below is a depiction of this process.
Figure 6.1.3-2 In-Place Row/Cell Manual Edits of Generated Dataset
HHPC 10061
6-7
Verify that this is the correct version before use.
6.1.4 Create, Manage and Share Data sets
The main operational purpose of the IMPALA platform is to provide an environment that allows users to
access data from multiple data sources for data querying, creating data sets, assembling reports to
support requirements, performing analyses, and constructing dashboards and visualizations.
A user can create a new dataset by extracting, querying, editing, filtering, merging or joining existing
datasets within the IMPALA data reservoir. The resulting dataset will be stored in the IMPALA data
reservoir, where additional tags and metadata can be added. The IMPALA platform traces the lineage
of data elements and datasets, allowing the originating source(s) of the dataset to be traced using
Waterline or Cloudera Navigator, along with the modifications that have been made to the dataset.
The IMPALA system provides data collaboration, which allows users to share datasets, investigations,
and analysis within a secure environment. Frequently used queries can be shared with other team
members. In progress or completed investigations can be shared for review, comments, or posterity. In
particular, the methods and queries used to extract and cleanse a data set can be stored and shared
with other users, along with comments about how and why certain adjustments were made to the
dataset. Authorized team members will be able to re-use search queries, cleansing scripts or workflows
to fulfill similar data needs in the future. Over time, the shared analysis methods, SQL queries, and
search criteria will become a wealth of knowledge that can be used to rapidly onboard new team
members.
6.2 ADMINISTRATOR USAGE SCENARIOS
The Administrator manages the users and the applications that comprise the IMPALA platform. The
Administrator does not work directly with the data, and has a separate set of usage scenarios.
6.2.1 User Provisioning
The IMPALA platform relies on guidance from the data governance document and the MEME
infrastructure team to generate a NAMS request template for the IMPALA platform. The NAMS request
template will define, among other attributes, a list of IMPALA sponsors, a list of IMPALA registration
authorities, and a list of IMPALA roles.
HHPC 10061
6-8
Verify that this is the correct version before use.
Figure 6.2.1-1 IMPALA User Provisioning Process
A user requesting access to the IMPALA platform will complete a “Request New Application Account”
form within the NAMS web UI. As part of the form, the user selects the IMPALA roles to which they
wish to belong, as well as their sponsor. The request is forwarded via email to the IMPALA sponsor who
reviews the request. An approved request will generate an email request to the IMPALA registration
authority; this is the individual or individuals with super user rights within the IMPALA platform to
provision new users. The IMPALA registration authority creates an account or user profile within each
application based on the IMPALA Role to component mapping defined in the data governance
framework. Once the user’s profile has been created within the IMPALA platform, the IMPALA
registration authority will send an email to the requesting user with instructions for accessing their
assigned components.
Using the NAMS system for account requests ensures a central location for tracking all accounts and
proper auditing.
6.2.2 System Monitoring
6.2.2.1 Log Aggregation
The IMPALA platform leverages the log aggregation solution in place within the MEME network. The
HHPIT security team will provide licensed log aggregator agents that will be deployed on all the IMPALA
platform application servers. The IMPALA data reservoir will be configured to forward its logs to the
defined log aggregator on a scheduled basis.
HHPC 10061
6-9
Verify that this is the correct version before use.
Figure 6.2.2.1-1 IMPALA Log aggregation process
Figure 6.2.2.1-1 is a high-level depiction of the various components of the IMPALA platform and a
sampling of the logs they forward to the log aggregation server. Log aggregation will be configured on
each application host server as well as within applications such as Cloudera, which supports the
forwarding of logs. The logs will be used primarily for auditing.
6.2.2.2 Health and Availability Monitoring
The IMPALA platform SNMP traps will be configured to forward metrics about the health of each server
and application within the IMPALA rack to SolarWinds. SolarWinds is the network and application
monitoring tool provided by the MEME infrastructure team. Health information reported about the
IMPALA platform will include performance data on CPU, memory, disk and network bandwidth
utilization. These traps will utilize the SolarWinds agents for communication with SolarWinds.
Availability status of the IMPALA platform will also be supported by reporting uptime information on
each server through SNMP to SolarWinds. For reachability, the servers within the IMPALA platform will
be configured not to block ICMP traffic but relay on the MEME.
The IMPALA data reservoir, built on the Cloudera Enterprise Data Hub (EDH) Hadoop technology,
provides visibility into the health of the data reservoir cluster through the Cloudera Manager
component. This component will also be configured to send SNMP traps to a trap collector or
monitoring tool such as SolarWinds. Cloudera Manager is also used by the Administrator to determine
the status and health of analytic jobs running within the IMPALA data reservoir. Information on how to
configure the data reservoir to report health information to a 3
rd
party monitoring tool is provided in
the Cloudera Administrator guide.
HHPC 10061
6-10
Verify that this is the correct version before use.
The above configuration will ensure that the IMPALA platform’s health is monitored and issue
resolution is proactive instead of reactive.
6.2.3 System Patching and Updates
The IMPALA platform has many components and services that enable its operation. These components
and services rely on specific OS configurations that affect their functionality. The installation and
administration guide provided with IMPALA platform details configuration parameters for each
component/service. It is important to note that System OS patching and updates is process oriented
and highly dependent on advancement in releases by COTS vendors.
System OS patches will be applied by the MEME network team and should be planned during
maintenance windows. To patch the windows based systems within the IMPALA platform, the MEME
network team will use Desktop Central. Patching of the Linux-based systems is a manual process that
involves downloading the Red Hat Package Manager (RPM) updates to each Linux server and then
deploying the patch using yum. This process is manual to ensure that packages do not interfere with
the operation of the applications and because the IMPALA platform has no internet access. Each
system should be backed up prior to applying a patch to ensure rollback in the event of a failure to the
applications hosted on the systems. For example, a Linux OS patch may update the kernel. This could
break the application if there are incompatibilities with the application in the new kernel. System
patches (OS) will be scheduled separately from application updates to more easily determine the source
of any faults or failures. Application update instruction with release notes are provided by the
application partners. Application configuration files should be backed up before applying updates.
6.2.4 Backup & Recovery
6.2.4.1 Data Backup Schedule
The IMPALA platform will leverage the NetBackup software used by the MEME Infrastructure team to
enable weekly full backup and daily incremental backup schedule. Incremental backups will be retained
for at least two weeks and Full backups retained for two weeks.
The storage unit has a 60TB capacity; 20TB will be used to support the virtual infrastructure and
ensuring high availability for the virtual machines. The 40TB left will be devoted to the backup process.
Figure 6.2.4.1-1 is a depiction of the backup schedule as well as the feasible retention and growth rate.
The first line of colored blocks indicates the type of backup, numbered incrementally. The blue blocks
indicate full backups and the green incremental. The second line shows the purging cycle for the
indicated backup (after two weeks). Red blocks indicate the deletion of the corresponding full backup
and the brown deletion of incremental backups. The third line shows the size of the backup, again, blue
for full and green for incremental. The fourth line (yellow) shows the cumulative size of the daily
backups on storage. The first full backup is 3TB, the next six incremental backups are 0.1TB in size, and
the eighth, is the cumulative size of the previous seven (7) plus the size of the new full backup (0.6TB
bigger than the previous full). At the end of the second week, we continue the same pattern and begin
to delete full and incremental backups from two weeks ago. This is reflected in the cumulative size of
30TB after 90 days. At the end of this period, the decision must be made on how much data to retain in
order to extend the life of the storage unit or if to leverage external storage devices like tape.
HHPC 10061
6-11
Verify that this is the correct version before use.
Full backup
Incremental backup
Delete Full backup
Summary
Take daily incremental backups from Mon Sat
Take weekly full backups on Sun
Keep daily incremental backups for 2 weeks
Keep weekly full backups for 2 weeks
At 3 TB of full backup size and 0.1 of incremental backup, the capacity needed at 90 days is approx. 30 TB
S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S S M T W T F S
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77
3.0 0.1 0.1 0.1 0.1 0.1 0.1 3.6 0.1 0.1 0.1 0.1 0.1 0.1 4.2 0.1 0.1 0.1 0.1 0.1 0.1 4.8 0.1 0.1 0.1 0.1 0.1 0.1 5.4 0.1 0.1 0.1 0.1 0.1 0.1 6.0 0.1 0.1 0.1 0.1 0.1 0.1 6.6 0.1 0.1 0.1 0.1 0.1 0.1 7.2 0.1 0.1 0.1 0.1 0.1 0.1 7.8 0.1 0.1 0.1 0.1 0.1 0.1 8.4 0.1 0.1 0.1 0.1 0.1 0.1 9.0 0.1 0.1 0.1 0.1 0.1 0.1 9.6 0.1 0.1 0.1 0.1 0.1 0.1 10.2 0.1 0.1 0.1 0.1 0.1 0.1 10.8
3 3.1 3.2 3.3 3.4 3.5 3.6 7.2 7.3 7.4 7.5 7.6 7.7 7.8 9.0 9.1 9.2 9.3 9.4 9.5 9.6 10.8 10.9 11.0 11.1 11.2 11.3 11.4 12.6 12.7 12.8 12.9 13.0 13.1 13.2 14.4 14.5 14.6 14.7 14.8 14.9 15.0 16.2 16.3 16.4 16.5 16.6 16.7 16.8
18.0 18.1 18.2 18.3 18.4 18.5 18.6 19.8 19.9 20.0 20.1 20.2 20.3 20.4 21.6 21.7 21.8 21.9 22.0 22.1 22.2 23.4 23.5 23.6 23.7 23.8 23.9 24.0 25.2 25.3 25.4 25.5 25.6 25.7 25.8 27.0 27.1 27.2 27.3 27.4 27.5 27.6 28.8
Week 13
Week 7
Week 8
Week 9
Week 10
Week 11
Week 12
Week 1
Week 2
Week 3
Week 4
Week 5
Week 6
Delete Incremental backup
Figure 6.2.4.1-1 IMPALA Backup Schedule Short Term
HHPC 10061
6-12
Verify that this is the correct version before use.
6.2.4.2 Virtual Machine (VM) Backup
VMs will be backed up on a regular schedule according to HH&P IT recommended best practices for the
existing virtual infrastructure.
HHPC 10061
7-1
Verify that this is the correct version before use.
7.0 DETAILED DESIGN
7.1 INFRASTRUCTURE DETAILED DESIGN
The IMPALA platform’s design is a single rack appliance with redundant power, network and data store.
Figure 7.1-1 is a diagram of the IMPALA Platform’s hardware inventory and power requirements. Tables
7.1-1 through 7.1-3 provide details of the individual system’s components. The IMPALA rack consists of
three (3) application hosts, an Enterprise Data Hub (2 master nodes and 4 data nodes), four switches,
and an iSCSI storage node. The following sub sections detail the contents of the IMPALA rack and how
they connect with the MEME network.
Figure 7.1-1 IMPALA Platform and Hardware Power Details
HHPC 10061
7-2
Verify that this is the correct version before use.
7.1.1 Application Nodes
The application nodes are running a virtual infrastructure. The virtual infrastructure includes physical
devices and software. The physical devices are three (3) servers as virtual hosts and a shared Network
Attached Storage (NAS) device. The virtualization software is VMWare vSphere Enterprise Plus. The
table below describes the components of the virtual infrastructure software and their functions.
TABLE 7.1.1-1 IMPALA VIRTUAL INFRASTRUCTURE SOFTWARE AND FUNCTIONS
Component Function
vCenter Server
A centralized management application for the VMWare virtualization
platform
ESXi Host
An operating system-independent hypervisor based on the VMKernel
Operating system interfacing with agents that run atop it. It is the
exclusive hypervisor for VMWare
The IMPALA platform leverages virtualization for deploying virtual servers that host the applications.
Each application node is configured as an ESXi host, an operating system-independent hypervisor. The
IMPALA application nodes provide the compute power for the components within the IMPALA
platform. These components are the COTS (Commercial-Off-The-Shelf) products that have been
integrated to meet the requirements of big data analytics. These components and their functions are
defined in the table below.
TABLE 7.1.1-2 IMPALA APPLICATIONS AND FUNCTIONS
Component Functions User Role
Trifacta
This visual data ingest and transformation tool provides an
intuitive interface that enables users to quickly cleanse,
transform, prepare and profile data without the need for
custom programming.
Internal User
Pentaho Data
Integrator
Component allows HH&P IT users to develop Extract-
Transform-Load (ETL) job to copy data from data sources such
as the EMR or the LSAH into the IMPALA data reservoir.
IT User
Alpine Chorus
Allows users to create local workspaces, workflows, select and
combine data sources, perform analytics, and share their
work. Alpine Chorus visual analytics natively supports
collaboration, model version control and importing data from
various data sources, including Hadoop and structured
databases at scale. The analytics engine features a drag-and-
drop interface to connect data to computational operators
and comes with a rich set of analytic tools including
classification, regression, decision trees, time series analysis,
and more.
Internal User
Waterline Data
Automatically discovers meta-data, categorizes data sets and
IT User
HHPC 10061
7-3
Verify that this is the correct version before use.
Component Functions User Role
creates a data inventory. Tracks the lineage of all data ingest
through analysis. Data Profiling improves search capability
through data cataloging and tagging that captures source,
inputs, context, and parameters.
Tableau Desktop
Provides intuitive, visual-based data discovery capabilities.
Also provides advanced business intelligence capabilities,
dashboards, and reports.
Internal User
Centrifuge
A big data discovery technology that provides the power and
flexibility to connect, visualize and collaborate. It combines
sophisticated link-analysis, interactive visualizations and
discovery features to dramatically simplify data pattern and
connection recognition
Internal User
Cloudera Data
Navigator
A complete data governance manager for Apache Hadoop,
which provides Data Life Cycle Management, Backup, and
capability to encrypt data at rest.
IT User
Kerberos
Kerberos is the foundation of securing IMPALA Platform
Hadoop cluster and is used to authenticate user’s access to
Cloudera Hadoop resources
IT User
IMPALA Portal/
Identify
Management
Single Landing Page that integrates the user interface to all
IMPALA Platform dashboards, reports and User Facing tools.
Hosted by HHP IT and fully integrated with NAMS/Active
Directory for user Authentication and Authorization.
Internal User
As depicted in Figure 4.4-1 in section 4.4, each application node is fitted with one iDRAC port and five
NICS; two 1Gbe dedicated to the user network, two 10Gbe dedicated to intra-rack or internal cluster
communications and one 1Gbe dedicated to management. For the user network, two 1Gbe NICs
connect to the two 1Gbe switches and are configured as a NIC team for redundancy. End users will
communicate with the applications over this user network, after going through the MEME VPN and
firewall. The application nodes will communicate with other nodes within the IMPALA rack through the
two (2) 10Gbe ports also configured as a NIC team and connected to the two 10Gbe switches.
The applications within the application nodes are the primary means for end users and administrators
to interact with the IMPALA Enterprise Data Hub, also known as the IMPALA Data Reservoir (see 7.2
below). The specifications of the IMPALA platform’s application nodes are listed in the table below:
HHPC 10061
7-4
Verify that this is the correct version before use.
TABLE 7.1.1-3 IMPALA APPLICATION NODE CONFIGURATION
Component
Specifications
PowerEdge R730xd
PowerEdge R730xd Server
Chassis Configuration
Chassis with up to 12, 3.5" Hard Drives and 2, 2.5" Flex Bay Hard
Drives
Processor
Intel® Xeon® E5-2630 v4 2.2GHz,25M Cache,8.0 GT/s
QPI,Turbo,HT,10C/20T (85W) Max Mem 2133MHz
Additional Processor
Intel® Xeon® E5-2630 v4 2.2GHz,25M Cache,8.0 GT/s
QPI,Turbo,HT,10C/20T (85W) Max Mem 2133MHz
Memory Capacity
(18) 32GB RDIMM, 2400MT/s, Dual Rank, x4 Data Width
RAID Configuration
RAID 1+RAID 5 for H330/H730/H730P (2 + 3-22 HDDs or SSDs)
RAID Controller
PERC H730 RAID Controller, 1GB NV Cache
Hard Drives
(2) 400GB Solid State Drive SAS Mix Use MLC 12Gbps 2.5in Flex
Bay Drive
Hard Drives
(5) 4TB 7.2K RPM NLSAS 12Gbps 512n 3.5in Hot-plug Hard Drive
Additional Network Cards
Intel X520 DP 10Gb DA/SFP+ Server Adapter
Embedded Systems Management
iDRAC8 Enterprise, integrated Dell Remote Access Controller,
Enterprise
Power Supply
Dual, Hot-plug, Redundant Power Supply (1+1), 750W
Power Cords
(2) NEMA 5-15P to C13 Wall Plug, 125 Volt, 15 AMP, 10 Feet (3m),
Power Cord, North America
Network Daughter Card
Intel X520 DP 10Gb DA/SFP+, + I350 DP 1Gb Ethernet, Network
Daughter Card
Hardware Support Services
3Yr Basic HW Warranty Repair, 5x10 NBD Onsite
7.1.2 Enterprise Data Hub
An Enterprise Data Hub (EDH) is a big data management model that uses a Hadoop platform as the
central data repository. The goal of an enterprise data hub is to provide an organization with a
centralized, unified data reservoir that can quickly provide diverse business users with the information
they need to do their jobs; to quickly gain value from that data through a collection of frameworks that
span data processing, interactive analytics, and real-time serving applications. The IMPALA Enterprise
Data Hub (EDH) leverages Cloudera Hadoop as the data reservoir for the IMPALA platform. As stated in
section 4, the data reservoir is configured to be highly redundant. The enterprise data hub also
supports all the functions of executing jobs in parallel within the IMPALA platform. Users access the
functionality of the data reservoir through the applications within the application nodes. Physically, the
IMPALA EDH consists of two (2) master nodes and four (4) data nodes. These nodes are deployed on
physical servers. The EDH nodes are deployed in a cluster that leverages the resources (CPU and RAM)
of all nodes.
HHPC 10061
7-5
Verify that this is the correct version before use.
Figure 7.1.2-1 Enterprise Data Hub Service Layout
7.1.2.1 Master Nodes
The master nodes are deployed in an HA (High Availability) configuration with one node serving the role
of the Active Name node and the other as the Standby Name Node. Name nodes assist in tracking
resources being used by the data nodes as well as managing the execution of jobs in parallel within the
IMPALA EDH or Data Reservoir.
The master nodes communicate with the rest of the nodes within the IMPALA rack via the 10Gbe intra
rack network described in section 4. They also serve as the gateway for the applications to
communicate with the data nodes. Finally, the master nodes host the services that enable
administrators to interact directly with the data reservoir. End Users will have no direct access to the
Data Reservoir; instead, end-users will access functionality and data through the application nodes and
the applications that reside there. Administrators will have SSH (secure shell) access to the data
reservoir to perform administrative tasks over the management network. They will also have GUI based
tools dedicated to performing administrative tasks.
The table below describes the hardware specifications of the master nodes:
TABLE 7.1.2.1-1 IMPALA MASTER NODE CONFIGURATION
Component Specifications
PowerEdge R730xd
PowerEdge R730xd Server
Chassis Configuration
Chassis with up to 12, 3.5" Hard Drives and 2, 2.5" Flex Bay
Hard Drives
Processor
Intel® Xeon® E5-2630 v4 2.2GHz,25M Cache,8.0 GT/s
QPI,Turbo,HT,10C/20T (85W) Max Mem 2133MHz
Additional Processor
Intel® Xeon® E5-2630 v4 2.2GHz,25M Cache,8.0 GT/s
QPI,Turbo,HT,10C/20T (85W) Max Mem 2133MHz
Memory DIMM Type and Speed
2400MT/s RDIMMs
HHPC 10061
7-6
Verify that this is the correct version before use.
Component Specifications
Memory Capacity
32GB RDIMM, 2400MT/s, Dual Rank, x4 Data Width
RAID Configuration
RAID 1 for H330/H730/H730P (2 + 3-22 HDDs or SSDs)
RAID Controller
PERC H730 RAID Controller, 1GB NV Cache
Hard Drives
400GB Solid State Drive SAS Mix Use MLC 12Gbps 2.5in Flex
Bay Drive
Hard Drives
4TB 7.2K RPM NLSAS 12Gbps 512n 3.5in Hot-plug Hard Drive
Additional Network Cards
Intel X520 DP 10Gb DA/SFP+ Server Adapter
Embedded Systems Management
iDRAC8 Enterprise, integrated Dell Remote Access Controller,
Enterprise
Power Management BIOS Settings
Performance BIOS Setting
Power Supply
Dual, Hot-plug, Redundant Power Supply (1+1), 750W
System Documentation
Electronic System Documentation and OpenManage DVD Kit
PCIe Riser
Risers with up to 1, x8 PCIe Slots + 2, x16 PCIe Slots
Network Daughter Card
Intel X520 DP 10Gb DA/SFP+, + I350 DP 1Gb Ethernet,
Network Daughter Card
Hardware Support Services
3Yr Basic HW Warranty Repair, 5x10 NBD Onsite
7.1.2.2 Data Nodes
The IMPALA Platform data nodes are the other part of the EDH or Data Reservoir. The IMPALA platform
consists of four (4) data nodes that scale by adding additional data nodes. Each data node performs the
functions of storing data as well as executing requests against data in parallel. Every block of data
within the data node is replicated 3 times for redundancy to ensure data is not lost. The end users have
no direct access to the data nodes; their access to the data is through the application nodes. The table
below depicts the specification of a data node.
HHPC 10061
7-7
Verify that this is the correct version before use.
TABLE 7.1.2.2-1 IMPALA DATA NODE CONFIGURATION
Component
Specifications
PowerEdge R730xd
PowerEdge R730xd Server
Processor
Intel® Xeon® E5-2630 v4 2.2GHz,25M Cache,8.0 GT/s
QPI,Turbo,HT,10C/20T (85W) Max Mem 2133MHz
Additional Processor
Intel® Xeon® E5-2630 v4 2.2GHz,25M Cache,8.0 GT/s
QPI,Turbo,HT,10C/20T (85W) Max Mem 2133MHz
Memory Configuration Type
Performance Optimized
Memory Capacity
32GB RDIMM, 2400MT/s, Dual Rank, x4 Data Width
RAID Configuration
RAID 1+JBOD for H330/H730/H730P (2 + 3-22 HDDs or SSDs)
RAID Controller
PERC H730 RAID Controller, 1GB NV Cache
Hard Drives
4TB 7.2K RPM NLSAS 12Gbps 512n 3.5in Hot-plug Hard Drive
Additional Network Cards
Intel X520 DP 10Gb DA/SFP+ Server Adapter
Embedded Systems
Management
iDRAC8 Enterprise, integrated Dell Remote Access Controller,
Enterprise
Power Supply
Dual, Hot-plug, Redundant Power Supply (1+1), 750W
PCIe Riser
Risers with up to 1, x8 PCIe Slots + 2, x16 PCIe Slots
Network Daughter Card
Intel X520 DP 10Gb DA/SFP+, + I350 DP 1Gb Ethernet, Network
Daughter Card
Hardware Support Services
3Yr Basic HW Warranty Repair, 5x10 NBD Onsite
7.1.3 Storage Node
The IMPALA Platform has an iSCSI NAS device that serves two functions: hosting configuration files and
disks for virtual machines and temporary backup for data that is within the data reservoir. The storage
node is only accessible from within the IMPALA rack. The table below depicts the specification for the
iSCSI storage device.
TABLE 7.1.3-1 IMPALA STORAGE NODE CONFIGURATION
Component
Specifications
EqualLogic PS4210E
EqualLogic PS4210E, Intel®Xeon®E5-2630v3, 2.4GHz, 32GB
Mem
RAID Configuration
RAID 6 for H330/H730/H730P (4-12 HDDs)
Hard Drives
(12) 6TB 7.2K SAS, 12Gb, 3.5
Power Supply
Dual, Hot-plug, Redundant Power Supply (1+1), 750W
Network Daughter Card
Dual Controllers, 10Gb, High Availability with Failover
Hardware Support Services
3Yr Basic Hardware Warranty Repair:
ProSupport: 7x24 HW / SW Tech Support and Assistance, 3
Year
HHPC 10061
7-8
Verify that this is the correct version before use.
7.1.4 System Software Specifications
This section provides a detailed description for each system software service. Before delving into the
specific tools, we will briefly discuss the operating systems for IMPALA Platform. On the client side, the
IMPALA platform will support clients running most operating systems including Windows, Linux and
Macs. The table below depicts the server-side operating systems and the counts.
TABLE 7.1.4-1 SERVER-SIDE OPERATING SYSTEMS AND COUNT
Component
Operating Systems
Number of Physical Servers
Number of Virtual
Servers
Server Side
RedHat Enterprise Linux
Version 6.7
6
16
7.1.5 Power consumption
The following two tables detail the power consumption for each component of the IMPALA platform.
TABLE 7.1.5-1 POWER CONSUMPTION
System Name
Operating
System
System
Function
Total Memory
Disk
Configuration
RAID
Total Disk
Input Power
(Watts)
Input Power
(BTU/h)
Power Supply
Capacity
(Watts)
Power Supply
Capacity
(BTU/h)
FS1048D-01
N/A
10Gb Switch
N/A
N/A
N/A
N/A
254.1 W
866.9 btu/h
460 W
1569.6 btu/h
FS1048D-02
N/A
10Gb Switch
N/A
N/A
N/A
N/A
254.1 W
866.9 btu/h
460 W
1569.6 btu/h
FS448D-01
N/A
1Gb Switch
N/A
N/A
N/A
N/A
65 W
221.8 btu/h
N/A
N/A
FS448D-02
N/A
1Gb Switch
N/A
N/A
N/A
N/A
65 W
221.8 btu/h
N/A
N/A
Master Node 1
RedHat 6.7
Active Name Node
256GB
5 x 4TB
6
12TB
368 W
1255.7 btu/h
750 W
2559.1 btu/h
Master Node 2
RedHat 6.7
Standby Name Node
256GB
5 x 4TB
6
12TB
368 W
1255.7 btu/h
750 W
2559.1 btu/h
RedHat Node 1
RedHat 6.7
App Host
512GB
5 x 4TB
6
12TB
392 W
1337.6 btu/h
750 W
2559.1 btu/h
RedHat Node 2
RedHat 6.7
App Host
512GB
5 x 4TB
6
12TB
392 W
1337.6 btu/h
750 W
2559.1 btu/h
RedHat Node 3
RedHat 6.7
App Host
512GB
5 x 4TB
6
12TB
392 W
1337.6 btu/h
750 W
2559.1 btu/h
Datanode04
RedHat 6.7
Hadoop Data
256GB
18 x 2TB
N\A
36TB
431 W
1470.6 btu/h
750 W
2559.1 btu/h
Datanode03
RedHat 6.7
Hadoop Data
256GB
18 x 2TB
N\A
36TB
431 W
1470.6 btu/h
750 W
2559.1 btu/h
Datanode02
RedHat 6.7
Hadoop Data
256GB
18 x 2TB
N\A
36TB
431 W
1470.6 btu/h
750 W
2559.1 btu/h
Datanode01
RedHat 6.7
Hadoop Data
256GB
18 x 2TB
N\A
36TB
431 W
1470.6 btu/h
750 W
2559.1 btu/h
EqualLogic
PS4210E
Windows Storage
Network Storage
12 x 6TB
6
48TB
327.2 W
116.3 btu/h
700 W
2446.1 btu/h
HHPC 10061
7-9
Verify that this is the correct version before use.
TABLE 7.1.5-2 POWER CONSUMPTION (CONTINUED)
System Name
Maximum
Potential Power
(Watts)
Maximum
Potential
Power
(BTU/h)
Input
Current
(A)
Sound
Power Level
(bels)
Airflow
Rate (CFM)
Airflow
Rate (l/s)
Air Temp
Rise (°C)
Air Temp
Rise (°F)
Weight
(lbs)
Weight
(Kg)
FS1048D-01-
01
420 W
N/A
1.2 A
N/A
N/A
N/A
N/A
N/A
19.40 lbs
8.8 Kg
FS-1048D-02
420 W
N/A
1.2 A
N/A
N/A
N/A
N/A
N/A
19.40 lbs
8.8 Kg
FS-448D-01
N/A
N/A
1.34 A
N/A
N/A
N/A
N/A
N/A
8.2 lbs
3.7 Kg
FS448D-02
N/A
N/A
1.34 A
N/A
N/A
N/A
N/A
N/A
8.2 lbs
3.7 Kg
Master Node 1
594.1 W
2027. btu/h
1.7 A
6.1 bels
27.5 CFM
13 l/s
24.1 °C
43.4 °F
71.7 lbs
32.5 Kg
Master Node 2
594.1 W
2027. btu/h
1.7 A
6.1 bels
27.5 CFM
13 l/s
24.1 °C
43.4 °F
71.7 lbs
32.5 Kg
RedHat Node
1
630.2 W
2150.3 btu/h
1.8 A
6.2 bels
28.7 CFM
13.6 l/s
24.6 °C
44.2 °F
71.7 lbs
32.5 Kg
RedHat Node
2
630.2 W
2150.3 btu/h
1.8 A
6.2 bels
28.7 CFM
13.6 l/s
24.6 °C
44.2 °F
71.7 lbs
32.5 Kg
RedHat Node
3
630.2 W
2150.3 btu/h
1.8 A
6.2 bels
28.7 CFM
13.6 l/s
24.6 °C
44.2 °F
71.7 lbs
32.5 Kg
Datanode04
658.8 W
2248 btu/h
2 A
6.5 bels
25 CFM
11.8 l/s
31 °C
55.9 °F
65 lbs
29.5 Kg
Datanode03
658.8 W
2249 btu/h
2 A
6.5 bels
25 CFM
11.8 l/s
32 °C
55.9 °F
65 lbs
29.5 Kg
Datanode02
658.8 W
2250 btu/h
2 A
6.5 bels
25 CFM
11.8 l/s
33 °C
55.9 °F
65 lbs
29.5 Kg
Datanode01
658.8 W
2251 btu/h
2 A
6.5 bels
25 CFM
11.8 l/s
34 °C
55.9 °F
65 lbs
29.5 Kg
Equal Logic
PS4210E
408.4 W
1393.5 btu/h
1.5 A
5.4 bells
19.3 CFM
9.1 l/s
22.8 °C
41 °F
60.84 lbs
27.6 Kg
7.2 SECURITY DETAILED DESIGN
This section describes the Security Detailed Design. As noted in section 4, Security for the IMPALA
platform is designed to protect the data at five discrete layers Perimeter Security, Authentication,
Authorization, Encryption and Policy. Within the IMPALA platform these five (5) layers are depicted as
modules, table 7.2-1 below is a mapping of the technical security layers to the IMPALA security
modules.
TABLE 7.2-1 PROTECTION POINT TO IMPALA MODULE MAPPING
Protection Layers
IMPALA Security
Module
IMPALA Security
Sub Modules
MEME Support
Perimeter
N/A
N/A
MEME-Firewall,
MEME-VPN, SSL
Certificate
Authentication
IMPALA Authentication
Kerberos
Active Directory,
Launchpad
Authorization
IMPALA RBAC
Sentry, Kerberos
Active Directory, NAMS
HHPC 10061
7-10
Verify that this is the correct version before use.
Protection Layers
IMPALA Security
Module
IMPALA Security
Sub Modules
MEME Support
Encryption
IMPALA Encryption
Navigator Encrypt,
Key Management
Server (KMS), Key
Trustee Server (KTS),
Hardware Security
Module (HSM)
Policy
IMPALA Auditing
Navigator Audit,
Navigator Lineage,
Waterline
CIS Benchmark,
Vulnerability
management (MVM),
Anti-virus and
Application/Memory
protection tool
(TrendMicro & EMET),
Log Aggregation
(Splunk),
inventory/system
management (Dell
KACE)
The following diagram depicts how the IMPALA security Modules fit in the MEME Network.
Figure 7.2-1 IMPALA Security Overview
HHPC 10061
7-11
Verify that this is the correct version before use.
All the security components are addressed in the sections that follow as well as how they communicate
with the data reservoir.
7.2.1 Perimeter Security
Figure 7.2.1-1 Perimeter Security
The IMPALA platform relies on the MEME environment and the JSC network to provide perimeter
security. The MEME firewall prevents unauthorized access from outside to services in the MEME
environment. Requests for access to the IMPALA environment will come through the NASA Access
Management System (NAMS). Requests will be forwarded to the IMPALA administrators to ensure that
the user is provisioned in the correct groups and roles as defined in the data governance guide.
End Users leverage the JSC NDC Active Directory domain for authentication. An end user will
authenticate using a PIV card through the NASA Launchpad interface. Before authentication the user
must first establish a Virtual Private Network (VPN) connection with the MEME environment to gain
access to the IMPALA platform portal page (Figure 4.5-2 depicts this process).
7.2.2 IMPALA Authentication Module
The IMPALA authentication module performs authentication at the service/component level and at the
end user level. It is also the integration point between the NASA Active directory domain and the
IMPALA platform. Within the IMPALA platform, there are three types of accounts: user, component
and service.
The user accounts are the end-user accounts that authenticate into the IMPALA platform and utilize its
services. User accounts must first be registered in the NASA Active directory domain (NDC). The NASA
environment is transitioning to the use of PIV cards, which enable certificate based two-factor
HHPC 10061
7-12
Verify that this is the correct version before use.
authentication against the NDC. These users through web interfaces access the IMPALA platform
components. The IMPALA web interfaces will be frontended by the NASA Launchpad application to
ensure that users can authenticate using their PIV cards. Finally, user profiles are created within the
IMPALA components during the user registration process. These profiles enable fine-grained privileges
within each component. These privileges are defined by roles within each component. Section 7.3.2
highlights these component roles.
Component accounts are accounts or principals defined within the IMPALA Key Distribution Center
(KDC). The KDC is described in detail later in this section. These accounts are created when the
components are installed and they are local to the IMPALA platform. Components provide access to
functionality within the IMPALA platform and serves as the only access to the IMPALA data reservoir.
Every component authenticates against the IMPALA KDC to identify itself as an authorized user to
execute jobs within the IMPALA data reservoir.
The service accounts are accounts created automatically during the deployment of the IMPALA data
reservoir. These accounts are also defined within the KDC to ensure that each service must first
authenticate before performing actions like assigning resources to execute jobs within the IMPALA data
reservoir.
The IMPALA data reservoir is configured to work with a Kerberos Key Distribution Center (KDC). The
KDC serves as the internal domain controller for the IMPALA data reservoir. All services and
components within the IMPALA data reservoir (Enterprise Data Hub - EDH) are registered within the
KDC. The KDC enables all component and service accounts to authenticate within the IMPALA data
reservoir using generated keytabs (keytabs are described below).
A one-way cross realm trust is setup between the NASA NDC domain and the IMPALA data reservoir
KDC. This trust ensures that only user accounts that have been registered within the NDC can access
services offered in the IMPALA data reservoir through the IMPALA components. Once an end user
authenticates into an IMPALA component and the profile is validated within the component, the
component account serves as a delegate for the user into the IMPALA data reservoir.
Figure 7.2.2-1 IMPALA Authentication Flow Diagram
HHPC 10061
7-13
Verify that this is the correct version before use.
Figure 7.2.2-1 above shows how Kerberos works with the Perimeter Security discussed earlier to
provide authentication for all users and services in the IMPALA platform. Before we delve deeper into
the IMPALA authentication Module, we begin by introducing some terminology in the Table 7.2.2-1
below.
TABLE 7.2.2-1 IMPALA AUTHENTICATION SECURITY MODULE TERMINOLOGIES
Terminology Description
Keytab
A keytab is a file containing pairs of Kerberos principals and encrypted keys
that are derived from the Kerberos password. IMPALA applications use
keytabs to authenticate to the IMPALA Data Reservoir services without
requiring human interaction. The keytabs are protected and only allow
access to services and never allow access to data.
Kerberos Principal
A Kerberos principal is a unique identity to which Kerberos can assign
tickets. Traditionally, a principal is divided into three parts: the primary,
the instance, and the realm. The format of a typical Kerberos V5 principal
is primary/instance@REALM. Service principals are used for accessing
services (including servers) in the IMPALA platform and User principals are
for user access.
Primary
The primary is the first part of the Kerberos principal. In the case of a user,
it's the same as your username. For a host, the primary is the wordhost”.
instance
The instance is an optional string that qualifies the primary. The instance is
separated from the primary by a slash (/). In the case of a user, the instance
is usually null, but a user might also have an additional principal, with an
instance called admin, which he/she uses for administration. The
principal
jennife[email protected].GOV is completely separate from the
principal jennifer/[email protected], with a separate password, and
separate permissions. In the case of a host, the instance is the fully qualified
hostname, e.g.,host/server.ndc.nasa.gov.
Realm
The realm is your Kerberos realm, which, in most cases, is your domain
name, in upper-case letters. For example, the machine server.ndc.nasa.gov
would be in the realmNDC.NASA.GOV”.
The IMPALA authentication module is the foundational building block for IMPALA Security and all other
submodules leverage it to function. The IMPALA authentication module is also the key player that
controls access to other components.
Kerberos internally uses a ticketing system that employs security keys. To achieve this, it has a Key
Distribution Center (KDC) that is responsible for managing and distributing credentials. To avoid
downtimes, we deploy Kerberos in a highly available mode by configuring a failover Slave KDC. The
following section details how Kerberos is configured and how it works.
Both the Master and Slave KDC servers define a distinct Kerberos realm and are deployed
internally to the IMPALA platform.
HHPC 10061
7-14
Verify that this is the correct version before use.
All servers within the IMPALA platform are configured with both the Kerberos realm (local to
IMPALA) and NASA AD domain. The default realm is set to be the local IMPALA Kerberos realm.
For example the Administrator could choose the local realm to be IMPALA.NDC.NASA.GOV and
the NASA AD could have a realm of NDC.NASA.GOV
Service principals (service accounts mentioned above) are created in the local IMPALA KDC
under the IMPALA Kerberos realm.
A one-way, cross-realm trust must be set up from the IMPALA Kerberos realm to the central
NASA AD domain containing the user principals (users) that require access to the data reservoir.
As a result, there is no need to create user principals in the local realm.
With the Modular Authentication design discussed above, Kerberos authentication can be configured
using the local IMPALA KDC independently of integrating with Active Directory. An advantage of this is
that the local IMPALA KDC serves as a shield for the central NASA Active Directory from the many hosts
and services in the IMPALA system since all service requests are granted locally instead of going out to
the NASA AD on each call.
It should be noted that service restarts in a large cluster that are not shielded as described above create
many simultaneous authentication requests which impact the NASA AD service.
7.2.3 IMPALA RBAC Module
In IMPALA, there are two classes of roles: component (or application) roles and data roles.
The component (or application) roles are predefined within the applications. These roles encapsulate
the privileges of a user within the component. Table 7.2.3-1 is a listing of the IMPALA components that
have defined roles and the privileges of those roles. Users are assigned to one of these roles when
registered within each application/component.
TABLE 7.2.3-1 IMPALA COMPONENT ROLES AND PRIVILEGES
Component Tool Roles Function
Alpine
Administrator
Setup up and assign users to profiles.
Establish/Create data source connections
Collaborator
View or execute created workflows. Create
comments
Developer
Create workflows. Share out workspaces
Trifacta
Data Admin
Create user profiles
Wrangler
Create transformation projects
Waterlinedata
Administrator
Create User Profiles
Data Steward
Create tagging profiles
HHPC 10061
7-15
Verify that this is the correct version before use.
Component Tool Roles Function
Annotator
Tag associations: Create, approve, and reject tag
associations (any tag domain)
Origins : Create, update, and remove origins
Lineage relationships : Create, approve, and reject
lineage relationships
Collections: Create, approve, and remove
collections
End User
Data : View authorized data
Metadata :View authorized metadata
Hive tables: Create Hive tables
Centrifuge
Administrator
Create User Profiles
Developer
Create exploration projects and visual dashboards
Pentaho Data Integrator
Administrator
Administer security
Read Content
Execute Jobs
Create Content
Power User
Read Content
Execute Jobs
Create Content
IMPALA Portal
Administrator
Administer Security: User profiles, roles etc
User
Create, Edit and View projects for analytics, search,
development of ETL etc
The data roles govern authorization to data within the IMPALA data reservoir using a method of
regulating access known as Role-Based Access Control (RBAC). In this context, access is the ability of an
individual user to perform a specific task against the data within the IMPALA data reservoir, such as
view, create, or modify a file. Apache Sentry and Record Services are the components used to enable
RBAC within the IMPALA platform. Although Record Service is still in a Beta release, the IMPALA team
has decided to use it as it best aligns with the requirements.
HHPC 10061
7-16
Verify that this is the correct version before use.
Figure 7.2.3-1 IMPALA Authorization Flow Diagram
Apache Sentry and Record Services are role-based authorization models for the IMPALA data reservoir,
offering fine-grained access to data accessible using schema. They both provide the ability to control
and enforce precise levels of privileges on data for authenticated users and IMPALA components (or
applications). It allows for the definition of authorization rules to validate access requests for resources
within the IMPALA data reservoir. The following sections discuss both Apache Sentry and Record
services in more detail.
Apache Sentry maps users and groups defined within the NDC active directory, as well as component
principals defined within the KDC, to roles within the IMPALA data reservoir. IMPALA administrators,
based on guidance from the data governance board, create the roles within the IMPALA data reservoir
and map these roles to the NDC active directory groups and component principals. Privileges to read,
write or both are granted to the individual roles for the data entities within the IMPALA platform via
policy files or directly using GRANT options within the Sentry Service.
Table 7.2.3-2 describes the Sentry components.
HHPC 10061
7-17
Verify that this is the correct version before use.
TABLE 7.2.3-2 IMPALA SENTRY COMPONENTS AND FUNCTIONS
Sentry Components Description
Sentry Server
The Sentry Remote Procedure Call (RPC) server manages the authorization
metadata. It supports interfaces to securely retrieve and manipulate the
metadata
Data Engine
The data engine loads the Sentry plugin. It intercepts all requests for
accessing data and routes them to the Sentry plugin for validation.
Sentry Plugin
The Sentry plugin runs in the data engine. It offers interfaces to
manipulate authorization metadata stored in the Sentry server, and
includes the authorization policy engine that evaluates access requests
using the authorization metadata retrieved from the server
With Sentry we gain the following RBAC (role-based access control) features for the IMPALA platform:
Secure authorization: Sentry provides the ability to control and enforce access to data and/or
privileges on data for authenticated users.
Fine-grained access control: Sentry provides support for fine-grained access control to data and
metadata. Sentry allows access control at the server, database, table, and view scopes at
different privilege levels including select, insert, and all allowing administrators to use views
to restrict access to columns or rows. Administrators can also mask data within a file as
required by leveraging Sentry and views with case statements or User-Defined Functions
(UDF)s.
Role-based authorization: Sentry supports ease of administration through role-based
authorization; you can easily grant multiple groups access to the same data at different privilege
levels. For example, for a particular data set you may give your HHPIT security team rights to
view all columns, your MEME admins rights to view only non-sensitive or non-PII (personally
identifiable information) columns. These rights also govern the ingest processing pipeline
through the user roles for inserts of new data into HDFS.
Multi-tenant administration: Sentry allows permissions on different data sets to be delegated to
different administrators. In the case of Hive/Apache Impala, Sentry allows administration of
privileges at the level of a database/schema.
Unified platform: Sentry provides a unified platform for securing data; it uses existing Hadoop
Kerberos security for authentication. In addition, the same Sentry policy can be enforced while
accessing data through either Hive or Impala.
Record Service maps users and groups defined within the NDC active directory, as well as component
principals defined within the KDC, to roles within the IMPALA data reservoir by leveraging the Apache
Sentry service. IMPALA administrators, based on guidance from the data governance board, create the
roles within the IMPALA data reservoir and map these roles to the NDC active directory groups and
component principals. Privileges to read, write or both are granted to the individual roles for the data
entities within the IMPALA platform using GRANT options within the Sentry Service
HHPC 10061
7-18
Verify that this is the correct version before use.
Table 7.2.3-3 describes the Record Service components.
TABLE 7.2.3-3 RECORD SERVICE COMPONENTS AND FUNCTIONS
Record Service
Components
Description
RecordServicePlanner
The Record Service Planner Generates tasks, performs authorization
checks, and handles metadata access.
RecordServiceWorker
The Record Service Worker Executes tasks, and reads and writes to the
IMPALA storage layer. It also returns rows in a canonical format.
Thrift APIs
The Thrift APIs allow connection to the two Record Service thrift services
mentioned above, Record Service Planner and Record Service Worker.
Client Integration
Libraries
This are Client integration Libraries that mostly cater for easy migration
to Record Service.
Using Record Service with Sentry provides these key benefits:
Fine-grained security enforcement: Record Service enforces column-level permissions
(projections), row-level permissions (filtering), and data masking across the IMPALA reservoir
components.
Performance: Record Service is designed to be on the main data access path, meaning it needs to
process every byte of data. Record Service scales horizontally to be able to run on the largest
Hadoop clusters and high efficiency. It uses the Apache Impala IO layer, which utilizes low-level
optimizations such as HDFS short-circuit reads and dynamic code generation to improve thread
throughput and reduce CPU utilization. Record Service brings these performance benefits to the
other components in Hadoop and accelerates their performance, despite adding a new layer in the
stack.
Simplicity: Record Service provides a higher level, logical abstraction for data. Datasets can be
specified as logical names (i.e. tables or views) and Record Service returns schemed objects (in
contrast to the storage APIs that deal with paths and bytes). This means that applications built on
top of the Record Service APIs do not need to worry about differences in file formats, the
underlying storage APIs, and other low level details.
Using Record Service caveats:
Beta Release: The current Record Service is still in Beta Release. However, the advantages of using
it are greater than the risks involved. Record Service has a huge following and is expected to be
production ready in a couple of months.
HHPC 10061
7-19
Verify that this is the correct version before use.
7.2.4 IMPALA Encryption Module
The IMPALA Encryption Module is leveraged to encrypt data at rest within the IMPALA platform using
AES-256 encryption. This ensures that any malicious access to the data does not lead to ability to read
PII information.
Figure 7.2.4-1 IMPALA Encryption Flow Diagram
The encryption module has four (4) sub components: Navigator Encrypt, Key Management Server
(KMS), Key Trustee Server (KTS) and the CPU Hardware Security Module (HSM).
HHPC 10061
7-20
Verify that this is the correct version before use.
TABLE 7.2.4-1 IMPALA ENCRYPTION COMPONENTS AND FUNCTIONS
Component Function
Navigator Encrypt
A client-side service that transparently encrypts data at rest without requiring
changes to applications and with minimal performance lag in the encryption or
decryption process. Advanced key management with Key Trustee Server and
process-based access controls in Navigator Encrypt enable organizations to meet
compliance regulations and ensure unauthorized parties or malicious actors
never gain access to encrypted data
Key Management
Server
A customized server that uses the Key Trustee Server for robust and scalable
encryption key storage and management
Key Trustee
Server
An enterprise-grade virtual safe-deposit box that stores and manages
cryptographic keys. With Key Trustee Server, encryption keys are separated
from the encrypted data, ensuring that sensitive data is protected in the event
that unauthorized users gain access to the storage media
CPU Hardware
Security Module
A service that allows Key Trustee Server to integrate with a hardware security
module (HSM). Key HSM enables Key Trustee Server to use an HSM as the root
of trust for cryptographic keys, taking advantage of Key Trustee Server’s policy-
based key and security asset management capabilities while satisfying existing
internal security requirements regarding treatment of cryptographic materials
Within the IMPALA Platform, the KMS and the KTS are configured to ensure high availability. The
Navigator Encrypt depicted above works with a Key Management Server and Key Trustee server to
manage the encryption keys transparently. For key administration, we recommend having an
administrator for the data and a security officer or administrator for the security keys to prevent
unauthorized access by super users.
HHPC 10061
7-21
Verify that this is the correct version before use.
Figure 7.2.4-2 Integration between the IMPALA Data Reservoir and the IMPALA Encryption
components
All encryptions within the IMPALA data reservoir are performed within encryption zones. An encryption
zone is a directory in HDFS with all of its contents, that is, every file and subdirectory in it, encrypted.
The files in this directory will be transparently encrypted upon write and transparently decrypted upon
read.
Each encryption zone (ez) is associated with a key (ez key) which is specified when the zone is created.
Each file within an encryption zone also has its own encryption/decryption key, called the Data
Encryption Key (DEK). These DEKs are never stored persistently unless they are encrypted with the
encryption zone's key. The encrypted DEK is known as the EDEK. The EDEK is then stored persistently as
part of the file's metadata on the Name Node which is a Master Node of the IMPALA reservoir. Since
the EDEK is encrypted, it has no value to eavesdroppers.
7.2.5 IMPALA Auditing
The IMPALA Auditing Module addresses reports on where data came from and how it is used. Internally
the module is comprised of Navigator Audit and Navigator Lineage. The module also leverages
Waterline for end-to-end data lineage.
7.2.5.1 Navigator Audit
The Navigator Audit configuration enables collection and filtering of audit events as they are added to
the IMPALA data reservoir. This is done using plug-ins provided as part of IMPALA data reservoir. The
plug-ins write the audit events to an audit log on the filesystem.
HHPC 10061
7-22
Verify that this is the correct version before use.
The IMPALA data reservoir audit components monitor the audit log files on the local filesystem and
sends these events to the Navigator Audit Server. Once the audit events are written to the audit log file,
they are guaranteed to be delivered (as long as filesystem is available) to the Navigator Audit Server.
The IMPALA data reservoir audit components keeps track of current audit event offset in the audit log
that it has successfully transmitted, so on any crash/restart it picks up the audit event from the last
successfully sent position and resumes.
This makes sure that all audit events in the IMPALA data reservoir are persisted.
7.2.5.2 Navigator Lineage
The Navigator Lineage provides an automatic collection and easy visualization of upstream and
downstream data lineage to and from the IMPALA data reservoir. For each data set in the IMPALA data
reservoir, it shows down to the column level within that data set, what the precise upstream data
sources were, the transforms performed to produce it, and the impact that data has on downstream
artifacts.
A lineage diagram is just a directed graph that depicts an entity and its relations to other entities.
With both auditing and lineage, an IMPALA security officer is able to see who had access to what
dataset and what queries or modifications they ran on the dataset. The IMPALA security officer is also
able to determine any unauthorized access to the IMPALA data reservoir datasets and take action.
To summarize, the IMPALA audit module is fundamental in understanding where the dataset came
from, which user touched the dataset, what were the security conditions on the dataset and are we still
meeting the security conditions at the end where the data rests or resides. Both Navigator Audit and
Navigator Lineage pull all audit logs and consolidates them making it easy to export to Syslog for
backup.
7.2.5.3 Waterline
The Waterline tool is also used to audit the data and provide lineage and compliments Navigator Audit
and Navigator Lineage. IMPALA extensively uses Waterline for Data Governance. We just mention that
tool here but suggest you view the Data Governance document to see other functions that the tools
provides in addition to complimenting Navigator Audit and Navigator Lineage.
7.2.5.4 Server Security Auditing
Servers within the IMPALA platform are hardened using the CIS Benchmark tool provided by the HHPIT
security team. Recommendations produced by the CIS benchmarking tool are addressed and a report is
generated for auditing purposes.
7.2.5.5 Anti-Virus Monitoring
Anti-virus agents provided by the MEME Infrastructure team are deployed on servers within the
IMPALA platform. These agents generate and publish reports to the MEME anti-virus server for
auditing.
HHPC 10061
8-1
Verify that this is the correct version before use.
8.0 INTERFACES
8.1 INBOUND INTERFACES
The following are all the inbound interfaces to the IMPALA Platform Solution.
Electronic Medical Records (EMR): Clinical, laboratory, and medical requirements data
o Epidemiologic data is ingested from NASA internal data sources such as case
investigation files, supplemental forms, clinical data and crewmember electronic
medical records (EMR system).
Lifetime Surveillance of Astronaut Health (LSAH): clinical, laboratory, medical requirements
data and various forms of data related to space flight
o Laboratory data such as test orders and lab results for crew members ingested from
various internal sources such as Lifetime Surveillance of Astronaut Health (LSAH)
system and existing laboratory systems
MEDB SharePoint: medical requirements summary data stored in SharePoint
o Environmental data from internal and external investigations, operations, studies,
researchers, experiments and management activities.
o Medical data collected for a mission program (Med B) from a mission SharePoint site,
This data is primarily sourced from SharePoint lists and document libraries (PDF, Word,
and Excel documents)
Other future inbound interfaces include Ad-hoc Datasets, VIIP, Cardio and CO2
8.2 OUTBOUND INTERFACES
The following are some of the outbound interfaces from the IMPALA Platform Solution.
Exams and Reports for incorporation into the MASH Report
o One operational use of the IMPALA system is to locate the exams and reports produced
through clinical and lab activities. At the end of a mission, these exams and reports are
located and summarized for presentation to the crewmember and flight surgeon.
Initially, this interface will present the analysts and flight surgeon with these files in
order for offline communication with the crew member. In the future, if it is desired to
allow the crew member view access into IMPALA, these exams and reports could be
made available to them through this interface as well.
Flight Surgeon Dashboard
o Before, during, and after a mission, clinical and lab testing must be done to fulfil MEDB
requirements. The flight surgeon dashboard is an interface where the flight surgeon can
track the status of reports and exams in the IMPALA platform, as they relate to a given
crew member’s mission and MEDB requirements.
HHPC 10061
8-2
Verify that this is the correct version before use.
Data Catalog
o The Data Catalog is a listing of all fields in the database along with their metadata.
Users can use a web browser interface to view the Data Catalog and explore the data
that exists in the data reservoir.
Other future outbound interfaces include
o Historical and Trend Data
o Reference and Metadata
o Audits and Exception Data
HHPC 10061
A-1
Verify that this is the correct version before use.
Appendix A Acronyms and Abbreviations
°C
degrees Celsius
°F
degrees Fahrenheit
A, AMP
Ampere
AD
Active Directory
AES
Advanced Encryption Standard
BIOS
basic input/output system
BTU
British Thermal Unit
Cardio
Cardio Ox (Defining the Relationship Between Biomarkers of
Oxidative and Inflammatory Stress and the Risk for Atherosclerosis
in Astronauts During and After Long-duration Spaceflight)
CB
Control Board
CDR
Critical Design Review
CFM
Cubic Feet per Minute
CIS
Center for Internet Security
CO2
carbon dioxide
COTS
Commercial-off-the-shelf
CPU
Central Processing Unit
CSV
comma separated value
DA
Dual Adapter
dB
decibel
DDR
Detailed Design Review
DEK
Data Encryption Key
DIMM
Dual in-line Memory Module
DNS
Domain Name Server
DOC
Microsoft Word document
DP
Dual Port
DVD
Digital Video Disk
EDEK
Encryption Data Encryption Key
EDH
Enterprise Data Hub
EMET
Enhanced Mitigation Experience Toolkit
EMR
Electronic Medical Record
ETL
Extract, Transform, Load
ez
Encryption zone
FS
Fortinet Switch
ft
feet
Gb
Gigabit
GB
Gigabyte
Gbe
Gigabit Ethernet
Gbps
Gigabits per second
GOV
Government
GT/s
Gigabit transfers/second
HHPC 10061
A-2
Verify that this is the correct version before use.
GUI
Graphical User Interface
h
hour
HA
High Availability
HDD
Hard Disc Drive
HDFS
Hadoop Distributed File System
HHP
Human Health and Performance
HHPC
Health and Human Performance Contract
HHPD
Human Health and Performance [Directorate]
HSM
Hardware Security Module
HT
Hyper Threaded
HW
Hardware
ICMP
Internet Control Message Protocol
iDRAC
Integrated Dell Remote Access Controller
IMPALA
Information Management Platform for Data Analytics and
Aggregation
in.
Inch(es)
IPTA
Initial Privacy Threshold Analysis
iSCSI
Integrated Small Computer System Interface
IRD
Information Resource Directorate
IT
Information Technology
ITAR
International Traffic in Arms Regulations
JBOD
just a bunch of drives
JDBC
Java Database Connectivity
JPEG, JPG
Joint Photographic Experts Group
JSC
Lyndon B. Johnson Space Center
JSON
JavaScript Object Notation
KDC
Key Distribution Center
kg
kilogram
KMS
Key Management Server
KTS
Key Trustee Server
l/s
liters per second
lbs
pounds
LSAH
Lifetime Surveillance of Astronaut Health
LSDA
Life Science Data Archive
m
meter
MASH
Mission Associated Summary of Health
max
maximum
MEDB
Medical Evaluation Document, Part B
MEME
Mission Extended Medical Enterprise
MHz
Megahertz
MLC
Multi-Level Cell
MS
Microsoft
HHPC 10061
A-3
Verify that this is the correct version before use.
MVM
McAfee Vulnerability Manager
N/A
Not Applicable
NACB
Network Access Control Bar
NAMS
NASA Access Management System
NAS
Network Attached Storage
NASA
National Aeronautics and Space Administration
NBD
Next Business Day
NDC
NASA Data Center
NEMA
National Electrical Manufacturers Association
NIC
Network Interface Card
NICS
Network Interface Cards
NLSAS
Near-Line Serial Attached SCSI
NV
Non-Volatile
ODBC
Open Database Connectivity
OOB
Out-of-Board
OS
operating system
PCIe
Peripheral Component Interconnect - Ethernet
PCN
Page Change Notice
PDF
Portable Document Format
PDI
Pentaho Data Integrator
PDR
Preliminary Design Review
PERC
PowerEdge Expandable RAID Controller
PIA
Privacy Impact Statement
PII
Personally Identifiable Information
PIV
Personal Identity Verification
PNG
Portable Network Graphic
QPI
QuickPath Interconnect
RAID
Redundant Array of Independent Disks
RAM
Random Access Memory
RBAC
Role-Based Access Control
RDBMS
Relational Database Oracle and Microsoft SQL Server
RDIMM
Registered Dual In-Line Memory Module
RPC
Remote Power Controller
RPM
Red Hat Package Manager
SAS
Serial Attached SCSI
SDD
System Design Document
SF
Designator for
SF5
Designator for the Information Systems Architecture Branch
SFP
Small form-factor Pluggable
SIEM
Security Information and Event Management
SME
Subject Matter Expert
SNMP
Simple Network Management Protocol
SOW
Statement of Work
HHPC 10061
A-4
Verify that this is the correct version before use.
SQL
Structured Query Language
SSD
Solid State Drive
SSH
Secure Shell
SSL
Secure Socket Layer
STRAW
System for Tracking and Registering Applications and Websites
SW
Software
TB
Terabyte
TLS
Transport Layer Security
TOR
Top-of-Rack
TRB
Technical Review Board
UDF
User-Defined Function
UDP
User Datagram Protocol
UI
User Interface
VIIP
Vision Impairment and Intracranial Pressure
VM
Virtual Machine
VMDK
Virtual Machine Disk
VPN
Virtual Private Network
W
Watts
XML
eXtensible Markup Language
Yr
Year