Skip to content

Commit

Permalink
Add Chronos job to do HDFS -> ElasticSearch monthly
Browse files Browse the repository at this point in the history
This adds a Chronos job to do a reindex of the targetsmart data once a
month. To fit in 1 GB of memory, I reduced the bulk index size of voters
from 1,000,000 -> 100,000.

Change-Id: Ic6cec051930f6e2cce8b5e7929aebe9df2ffb8d1
Reviewed-on: https://code.brigade.com/6775
Tested-by: Leeroy Jenkins <[email protected]>
Reviewed-by: John Miller <[email protected]>
Reviewed-by: Shane da Silva <[email protected]>
  • Loading branch information
tdooner committed Oct 22, 2015
1 parent 6f44ad7 commit 39bc691
Show file tree
Hide file tree
Showing 4 changed files with 43 additions and 11 deletions.
2 changes: 2 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@ RUN touch /var/lib/rpm/{,*} \
RUN useradd app \
&& wget -O /usr/bin/gosu https://github.com/tianon/gosu/releases/download/1.4/gosu-amd64 \
&& chmod +x /usr/bin/gosu
***REMOVED***
&& chmod +x /usr/bin/pv
RUN easy_install virtualenv
ADD entrypoint.sh /usr/local/bin/entrypoint
ENTRYPOINT ["/usr/local/bin/entrypoint"]
36 changes: 36 additions & 0 deletions deploy.yml
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,42 @@ default: &default
uris:
- '<%= $deploy_variables[:nexus_url] %>'

chronos:
verifier_reindex_all:
# the "name" and "environmentVariables" keys are added at deploy-time
epsilon: PT60S
executor: ''
executorFlags: ''
retries: 2
***REMOVED***
ownerName: ''
async: false
cpus: 1.0
disk: 256.0
mem: 1024.0
softError: false
dataProcessingJobType: false
uris:
- '<%= $deploy_variables[:nexus_url] %>'
highPriority: false
runAsUser: root
# TODO: This command should be baked into the Docker image (as its
# entrypoint), however a Chronos bug prevents us from using `shell: false`
# and simply passing `arguments`. https://github.com/mesos/chronos/issues/567
command: '/usr/local/bin/entrypoint'
container:
type: docker
image: '<%= $deploy_variables[:docker_image] %>'
network: BRIDGE
scheduleTimeZone: UTC
# TODO: Pass these arguments into the ./bin/run script. This is blocked
# on a Chronos bug which adds the value of the arguments array to the
# task ID which fails because slashes are invalid in task IDs.
# https://github.com/mesos/chronos/issues/568
arguments: ['bash', 'index_all.sh']
***REMOVED***
schedule: R/2015-06-20T00:00:00.000Z/P1M # every month

###############################################################################
# EDGE
###############################################################################
Expand Down
2 changes: 1 addition & 1 deletion index.py
Original file line number Diff line number Diff line change
Expand Up @@ -129,7 +129,7 @@ def index_records(index_name, voters):

***REMOVED***

if len(voters) >= 1000000:
if len(voters) >= 100000:
index_records(index, voters)
voters = []

Expand Down
14 changes: 4 additions & 10 deletions index_all.sh
Original file line number Diff line number Diff line change
@@ -1,19 +1,13 @@
#!/usr/bin/env bash
set -e
set -exuo pipefail

echo "Finding total number of records..." >&2
# Cached value from running the below command:
total=228459351
***REMOVED***
# --user=brigade_media --password=$TARGETSMART_PASSWORD -O - | awk '{ sum+=$1} END {print sum}')
echo "... ${total} records" >&2

for FILE in $(python list_files.py); do
total=230000000 # <- approximately correct value:
for FILE in $(env/bin/python list_files.py); do
echo "Processing file $(basename $FILE)..." >&2
wget \
--timeout 900 \
--output-document - \
--quiet \
$FILE \
| gunzip
done | pv -l -s $total | python index.py $total
done | pv -l -s $total | env/bin/python index.py $total

0 comments on commit 39bc691

Please sign in to comment.