Skip to content
This repository has been archived by the owner on Aug 30, 2022. It is now read-only.

Commit

Permalink
Support for loading data from Greenplum databases (#692)
Browse files Browse the repository at this point in the history
* Integration with Greenplum: WIP
* Script supporting greenplum data dumping
* Python scripts working with either Python2 or Python3
* Document federated database use
* Remove temporary files after loading
  • Loading branch information
Mihai Budiu authored Sep 11, 2020
1 parent 2337324 commit f472529
Show file tree
Hide file tree
Showing 72 changed files with 1,098 additions and 392 deletions.
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ hillview.redo
*.log.lck
*.log.lck.*
*.pid
*.pyc
out
apache-tomcat-*
web/classes
Expand Down Expand Up @@ -70,6 +71,7 @@ web/src/main/webapp/dist
hs_err_pid*

tmp
repository/*.jar

# data which is too big to put into git
data/ontime/On_Time_On_Time*
Expand Down
36 changes: 19 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
![Hillview project logo](hillview-logo.png)

Hillview: a big data spreadsheet. Hillview is a cloud-based
service for visualizing interactively large datasets.
The hillview user interface executes in a browser.
service for visualizing interactively large datasets.
The hillview user interface executes in a browser.

Contents:

Expand Down Expand Up @@ -115,16 +115,18 @@ Hillview uses `ssh` to deploy code on the cluster. Prior to
deployment you must setup `ssh` on the cluster to use password-less
access to the cluster machines, as described here:
https://www.ssh.com/ssh/copy-id. You must also install Java on all
machines in the cluster.
machines in the cluster. Each machine in the cluster must allow
connections on the network ports described in the [configuration
file](#service-configuration).

*Please note that Hillview allows arbitrary access to files on the
worker nodes from the client application running with the privileges
of the user specified in the configuration file.*

## 3.1 Service configuration

The configuration of the Hillview service is described in a Json file
(enhanced with comments); two sample files are `bin/config.json`and
The configuration of the Hillview service is described in a Json file
(enhanced with comments); two sample files are `bin/config.json`and
`bin/config-local.json`. The file `config-local.json` treats the local
machine as a one-machine cluster.

Expand Down Expand Up @@ -228,7 +230,7 @@ They are described [here](bin/README.md).
# 4. Developing Hillview

We only provide development instructions for Linux or MacOS, but there is
no reason Hillview could not be developed on Windows.
no reason Hillview could not be developed on Windows.

## 4.1. Software Dependencies

Expand Down Expand Up @@ -307,7 +309,7 @@ Subsequent builds can just run
$ bin/rebuild.sh
```

Hillview is currently split into two separate Maven projects. One can
Hillview is currently split into two separate Maven projects. One can
also build the two projects separately, as follows:

* platform: pure Java, includes the entire back-end. This produces a
Expand Down Expand Up @@ -342,7 +344,7 @@ standard.
Download and install Intellij IDEA: https://www.jetbrains.com/idea/.
The web project typescript requires the (paid) Ultimate version of Intellij.

First run maven to generate the Java code automatically generated for gRPC:
First run maven to generate the Java code automatically generated for gRPC:

```
$ cd platform
Expand All @@ -355,7 +357,7 @@ add three modules: web/pom.xml, platform/pom.xml, and the root folder hillview i

## 4.5. Setup VS Code

Download and install Visual Studio Code: https://code.visualstudio.com/download.
Download and install Visual Studio Code: https://code.visualstudio.com/download.
Here is a step-by-step guide to add the necessary extensions, run Maven commands, and attach a debugger:

1. Install these extensions and then restart the VS Code.
Expand All @@ -364,24 +366,24 @@ Here is a step-by-step guide to add the necessary extensions, run Maven commands
- `Language Support for Java(TM) by Red Hat
redhat.java`: recognize projects with Maven or Gradle build in the directory hierarchy.
- `Maven for Java`: provides a project explorer and shortcuts to execute Maven commands.
2. Select `Add workspace folder...` at the Welcome page, then choose `hillview/platform/` directory. The platform module should be displayed in the `Explorer` view.
3. Add `web` module to the workspace by clicking `File`->`Add Folder to Workspace...` and then choose `hillview/web/` directory.
2. Select `Add workspace folder...` at the Welcome page, then choose `hillview/platform/` directory. The platform module should be displayed in the `Explorer` view.
3. Add `web` module to the workspace by clicking `File`->`Add Folder to Workspace...` and then choose `hillview/web/` directory.
4. Save the workspace by clicking `File`->`Save Workspace As...` and store it in your personal folder outside `hillview/` root directory.
5. Next, about executing Maven commands; in the `Explorer` view, click `MAVEN PROJECTS`. There are two Maven folders correspond to `web` and `platform` modules;
5. Next, about executing Maven commands; in the `Explorer` view, click `MAVEN PROJECTS`. There are two Maven folders correspond to `web` and `platform` modules;
click those folders to expand and display the Maven pom files. The Maven commands will be displayed by right clicking the pom files.
6. Finally, about attaching a debugger:
- Bring up the `Run` view, select the `Run` icon in the `Activity Bar` on the left side of VS Code.
- From the `Run` view, click `create a launch.json file`, you will see the `platform` and `web` modules listed. We will create two `launch.json` files, one for `platform` module and the other for `web` module.
- When configuring the `launch.json` for `platform` module, you must select `Java` option. Otherwise, choose `Chrome (preview)` option when configuring the `web` module. Then, delete the auto generated `configurations`
and specify the correct configuration to attach the debugger. The important fields are `url`, `hostname`, `port`, and `request`. More about this is here
- From the `Run` view, click `create a launch.json file`, you will see the `platform` and `web` modules listed. We will create two `launch.json` files, one for `platform` module and the other for `web` module.
- When configuring the `launch.json` for `platform` module, you must select `Java` option. Otherwise, choose `Chrome (preview)` option when configuring the `web` module. Then, delete the auto generated `configurations`
and specify the correct configuration to attach the debugger. The important fields are `url`, `hostname`, `port`, and `request`. More about this is here
[VS Code Debugging#launch-configuration](https://code.visualstudio.com/docs/editor/debugging#_launch-configurations) and [VS Code#Java-Debugging](https://code.visualstudio.com/docs/java/java-debugging#_attach).

## 4.6 Debugging

Debugging on a single machine can done as follows:
- you can start the back-end service under the debugger,
by starting the HillviewBackend binary with command-line arguments 127.0.0.1:3569
- you can start the front-end service by attaching
- you can start the front-end service by attaching
to the Java process created by Java Tomcat. The frontend-start.sh
script has a line that sets up the environment variables to enable this.

Expand Down Expand Up @@ -426,6 +428,6 @@ Here is a step-by-step guide to submitting contributions:
`@Nullable` annotation (from javax.annotation) for all pointers which
can be null. Use `Converters.checkNull` to cast a @Nullable pointer to a
non-null pointer.

* Some code executes on multiple machines or in multiple threads. In particular,
all classes that derive from `IMap` or `ISketch` should be immutable.
3 changes: 3 additions & 0 deletions bin/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,9 @@
deploys it, and restarts the service
* `upload-file.sh`: Given a csv file it will guess a schema for it and
upload it to a remote cluster chopped into small pieces.
* `dump-greenplum.sh`: This script is used to connect Hillview
with [Greenplum](https://greenplum.org/) distributed databases.
It should be installed on each Greenplum worker machine

The following are templates that are used to generate actual shell scripts
on a remoate cluster when Hillview is installed
Expand Down
3 changes: 2 additions & 1 deletion bin/delete-data.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
#!/usr/bin/env python3
#!/usr/bin/env python
# We attempted to make this program work with both python2 and python3

"""This script deletes a specific folder on all the machines in a Hillview cluster."""
# pylint: disable=invalid-name
Expand Down
54 changes: 54 additions & 0 deletions bin/deploy-greenplum.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
#!/usr/bin/env python
# This script installs Hillview next to a greenplum database.
# It needs a config-greenplum.json file that has a description of the
# Greenplum database installation. See also the section
# on Greenplum installation from https://github.com/vmware/hillview/blob/master/docs/userManual.md

# Copyright (c) 2020 VMware Inc. All Rights Reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# pylint: disable=invalid-name
from argparse import ArgumentParser
from jproperties import Properties
import os
from hillviewCommon import ClusterConfiguration, get_config, get_logger, execute_command

def main():
parser = ArgumentParser()
parser.add_argument("config", help="json cluster configuration file")
args = parser.parse_args()
config = get_config(parser, args)

execute_command("./package-binaries.sh")
web = config.get_webserver()
web.copy_file_to_remote("../hillview-bin.zip", ".", "")
web.copy_file_to_remote("config-greenplum.json", ".", "")
web.run_remote_shell_command("unzip -o hillview-bin.zip")
web.run_remote_shell_command("cd bin; ./upload-data.py -d . -s dump-greenplum.sh config-greenplum.json")
web.run_remote_shell_command("cd bin; ./redeploy.sh -s config-greenplum.json")
web.copy_file_to_remote("../repository/PROGRESS_DATADIRECT_JDBC_DRIVER_PIVOTAL_GREENPLUM_5.1.4.000275.jar",
config.service_folder + "/" + config.tomcat + "/lib", "")
# Generate properties file
with open("greenplum.properties", "rb") as f:
p = Properties()
p.load(f, "utf-8")
p["greenplumDumpScript"] = config.service_folder + "/dump-greenplum.sh"
with open("hillview.properties", "wb") as f:
p.store(f, encoding="utf-8")
web.copy_file_to_remote("hillview.properties", config.service_folder, "")
os.remove("hillview.properties")

if __name__ == "__main__":
main()
3 changes: 2 additions & 1 deletion bin/deploy.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
#!/usr/bin/env python3
#!/usr/bin/env python
# We attempted to make this program work with both python2 and python3

"""This python program deploys the files needed by the Hillview service
on the machines specified in the configuration file."""
Expand Down
3 changes: 2 additions & 1 deletion bin/download-data.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
#!/usr/bin/env python3
#!/usr/bin/env python
# We attempted to make this program work with both python2 and python3

"""This script takes a cluster configuration and a file pattern.
It downloads the files that match from all machines in the cluster."""
Expand Down
28 changes: 28 additions & 0 deletions bin/dump-greenplum.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
#!/bin/bash
#
# Copyright (c) 2020 VMware Inc. All Rights Reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# This script is used when connecting to a Greenplum database
# to dump data in an external web table. See
# https://gpdb.docs.pivotal.io/6-10/admin_guide/load/topics/g-defining-a-command-based-writable-external-web-table.html
# and https://gpdb.docs.pivotal.io/6-10/ref_guide/sql_commands/CREATE_EXTERNAL_TABLE.html
# The script receives data at stdin

# Single argument is the directory where the data is to be dumped
DIR=$1
PREFIX="file"
mkdir -p ${DIR} || exit 1
echo "$(</dev/stdin)" >${DIR}/${PREFIX}${GP_SEGMENT_ID}
11 changes: 11 additions & 0 deletions bin/greenplum.properties
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# This properties file is a blueprint for a hillview.properties file
# used with a Greenplum installation.

###########################################################
# Parameters interfacing Hillview with a Greenplum database

# This script is invoked when data is dumped from an external web table
greenplumDumpScript = /home/gpdamin/hillview/dump-greenplum.sh
# This directory is used to store the data dumped from Greenplum before it's parsed by Hillview.
# The directory must be writable by the segment hosts.
greenplumDumpDirectory = /tmp
11 changes: 9 additions & 2 deletions bin/hillviewCommon.py
Original file line number Diff line number Diff line change
@@ -1,15 +1,20 @@
"""Common functions used by the Hillview deployment scripts"""

# pylint: disable=invalid-name,too-few-public-methods, bare-except
from __future__ import print_function
import os.path
import os
import subprocess
import tempfile
import json
import getpass
import logging
import sys
from argparse import ArgumentParser

is3 = sys.version_info[0] == 3
print("Python version is", 3 if is3 else 2)

def get_logger(module_name):
""" Returns the logger object """
logger = logging.getLogger(module_name)
Expand Down Expand Up @@ -37,8 +42,10 @@ class RemoteHost:
"""Abstraction for a remote host"""
def __init__(self, user, host, parent, heapsize="200M"):
"""Create a remote host"""
assert isinstance(user, str)
assert isinstance(host, str)
global is3
if is3:
assert isinstance(user, str)
assert isinstance(host, str)
assert parent is None or isinstance(parent, RemoteHost)
self.host = host
self.user = user
Expand Down
5 changes: 3 additions & 2 deletions bin/install-dependencies.sh
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@ esac
${SUDO} ${INSTALL} install wget maven ${NODEJS} ${NPM} ${LIBFORTRAN} unzip gzip python3
echo "Installing typescript compiler"
${SUDO} npm install -g [email protected]
pip install jproperties

# Download apache if not there.
pushd ..
Expand Down Expand Up @@ -75,6 +76,6 @@ popd

# Install Cassandra and populate a test database
if [ ${INSTALL_CASSANDRA} -eq 1 ]; then
./${mydir}/install-cassandra.sh
sudo apt install mysql-server
./${mydir}/install-cassandra.sh
sudo apt install mysql-server
fi
13 changes: 11 additions & 2 deletions bin/package-binaries.sh
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,21 @@
# Should be run after the binaries have been built.
# This archive has to be unpacked in the toplevel Hillview folder.

set -e
set -ex

ARCHIVE=hillview-bin.zip
#TARARCHIVE=hillview.tar.gz

#echo "Creating ${ARCHIVE} and ${TARARCHIVE} in toplevel directory."
echo "Creating ${ARCHIVE} in toplevel directory."
cd ..

FILES="platform/target/hillview-server-jar-with-dependencies.jar web/target/web-1.0-SNAPSHOT.war platform/target/DataUpload-jar-with-dependencies.jar bin/*.py bin/*.sh bin/*.bat bin/config.json bin/config-local.json"

rm -f ${ARCHIVE}
zip ${ARCHIVE} platform/target/hillview-server-jar-with-dependencies.jar web/target/web-1.0-SNAPSHOT.war platform/target/DataUpload-jar-with-dependencies.jar bin/*.py bin/*.sh bin/*.bat bin/config.json bin/config-local.json
zip ${ARCHIVE} ${FILES}

#rm -f ${TARARCHIVE}
#tar cvfz ${TARARCHIVE} ${FILES}

cd bin
4 changes: 2 additions & 2 deletions bin/rebuild.sh
Original file line number Diff line number Diff line change
Expand Up @@ -44,8 +44,8 @@ if [ "x${TOOLSARGS}" != "x" ]; then
fi
export MAVEN_OPTS="-Xmx2048M"
pushd ${mydir}/../platform
mvn ${TOOLSARGS} ${TESTARGS} clean install
mvn ${TOOLSARGS} ${TESTARGS} install
popd
pushd ${mydir}/../web
mvn ${TESTARGS} clean package
mvn ${TESTARGS} package
popd
3 changes: 2 additions & 1 deletion bin/run-on-all.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
#!/usr/bin/env python3
#!/usr/bin/env python
# We attempted to make this program work with both python2 and python3
# -*-python-*-

"""This script runs a command on all worker hosts of a Hillview cluster."""
Expand Down
3 changes: 2 additions & 1 deletion bin/start.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
#!/usr/bin/env python3
#!/usr/bin/env python
# We attempted to make this program work with both python2 and python3

"""This python starts the Hillview service on the machines
specified in the configuration file."""
Expand Down
3 changes: 2 additions & 1 deletion bin/status.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
#!/usr/bin/env python3
#!/usr/bin/env python
# We attempted to make this program work with both python2 and python3

"""This program checks if the Hillview service is running on all machines
specified in the configuration file."""
Expand Down
3 changes: 2 additions & 1 deletion bin/stop.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
#!/usr/bin/env python3
#!/usr/bin/env python
# We attempted to make this program work with both python2 and python3

"""This Python program stops the Hillview service on the machines specified in the
configuration file."""
Expand Down
5 changes: 3 additions & 2 deletions bin/upload-data.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
#!/usr/bin/env python3
#!/usr/bin/env python
# We attempted to make this program work with both python2 and python3

"""This script takes a set of files and a cluster configuration describing a set of machines.
It uploads the files to the given machines in round-robin fashion.
Expand Down Expand Up @@ -78,7 +79,7 @@ def main():
if args.files:
copy_files(config, folder, args.files, copyOptions)
else:
logger.error("No files to upload to the machines provided in a Hillview configuration")
logger.info("No files to upload to the machines provided in a Hillview configuration")
logger.info("Done.")

if __name__ == "__main__":
Expand Down
Loading

0 comments on commit f472529

Please sign in to comment.