Hue error with MySQL 8:
Error accessing DB: (2059, "Authentication plugin 'caching_sha2_password' cannot be loaded: /usr/lib64/mysql/plugin caching_sha2_password.so: cannot open shared object file: No such file or directory")
To fix:
ALTER USER hue IDENTIFIED WITH mysql_native_password BY 'password';
Tuesday, February 12, 2019
Saturday, February 2, 2019
Display lP address of machine on the banner without login
Add following lines to /etc/rc.local:
# initial setup, cp /etc/issue /etc/issue.orig
cp /etc/issue.orig /etc/issue
ip addr|grep inet |grep -i enp|awk {'print $NF, $2'}|awk -F'/' {'print $1'} >> /etc/issue
echo "" >> /etc/issue
# initial setup, cp /etc/issue /etc/issue.orig
cp /etc/issue.orig /etc/issue
ip addr|grep inet |grep -i enp|awk {'print $NF, $2'}|awk -F'/' {'print $1'} >> /etc/issue
echo "" >> /etc/issue
Monday, December 31, 2018
Demo code to show case Solr ingestion
# Written for Solr 7 (shipped with CDH6) and Python3
import tika import json import urllib3 import traceback import os tika.initVM() from tika import parser url = 'http://node02.dbaglobe.com:8983/solr/cms/update/json/docs?commit=true' filelist = ['D:\\Temp\\Building Positive Relationships young children.pdf', 'D:\\Temp\\Building Positive Relationships spouse n in laws.pdf']http = urllib3.PoolManager()for file in filelist: try: parsed = parser.from_file(file) #Add content to "combined" dict object combined={} combined['id']=os.path.basename(file) # use file name as Doc ID combined.update(parsed["metadata"]) combined['content']=parsed["content"] combined_json = json.loads(json.dumps(combined)) print(combined_json) # to clean up, execute solr command# use immutable to avoid error "This ConfigSet is immutable.", use below to create the template before create the collection # http://node02:8983/solr/admin/configs?action=CREATE&name=myConfigSet&baseConfigSet=schemalessTemplate&configSetProp.immutable=false&wt=xml # to search: content:"Psychologist" response = http.request('POST',url,body=json.dumps(combined_json),headers={'Content-Type': 'application/json'}) print (response.data) except: print(traceback.format_exc()) *:*
Sunday, December 16, 2018
Sentry and Hive permission explained
When using Sentry, the impersonation feature of HiveServer2 is disabled and each query runs in the cluster as the configured Hive principal. Thus, each HDFS location associated with a Hive table should be readable and writable by the Hive user or group.
If you are using the HDFS ACL synchronization feature, the required HDFS permissions (
r-x
for SELECT
, -wx
for INSERT
, and rwx
for ALL
) on files are enforced automatically and maintained dynamically in response to changes in privilege grants on databases and tables. In our example, the alice user would be given r-x
permission to files in tables in the sales
database. Note that a grant on a URI
object does not result in corresponding permissions on the location in HDFS.Saturday, October 13, 2018
Tweet Read Example with PySpark streaming analytics
===================TweetsRead.py===================
import tweepy from tweepy import Stream from tweepy.auth import OAuthHandler from tweepy.streaming import StreamListener import socket import json consumer_key = ''consumer_secret = ''access_token = ''access_secret = '' class TweetsListener(StreamListener): def __init__(self, csocket): self.client_socket = csocket def on_data(self, data): try: msg = json.loads(data) print(msg['text'].encode('utf-8')) self.client_socket.send(msg['text'].encode('utf-8')) return True except BaseException as e: print("Error on data: %s" % str(e)) return True def on_error(self, status): print(status) return True def sendData(c_socket): auth = OAuthHandler(consumer_key, consumer_secret) auth.set_access_token(access_token, access_secret) twitter_stream = Stream(auth, TweetsListener(c_socket)) twitter_stream.filter(track=['china']) if __name__ == "__main__": s = socket.socket() # create socket object host = "127.0.0.1" port = 5555 s.bind((host, port)) print("Listening on port: %s" % str(port)) s.listen(5) c, addr = s.accept() print("Received request from: " + str(addr)) sendData(c)
===================TwitterAnalytics.py===================
import findspark findspark.init('/Users/donghua/spark-2.3.2-bin-hadoop2.7') from pyspark import SparkContext from pyspark.streaming import StreamingContext from pyspark.sql import SQLContext from pyspark.sql.functions import desc # Not required if running in Pyspark integrated notebooksc = SparkContext() ssc = StreamingContext(sc, 10) sqlContext = SQLContext(sc) socket_stream = ssc.socketTextStream("127.0.0.1",5555) lines = socket_stream.window(20) from collections import namedtuple fields = ("tag","count") Tweet = namedtuple('Tweet', fields) # use () for multiple lines(lines.flatMap(lambda text: text.split(" ")) .filter(lambda word: word.startswith("#")) .map(lambda word: (word.lower(), 1)) .reduceByKey(lambda a, b : a + b) .map(lambda rec: Tweet(rec[0],rec[1])) .foreachRDD(lambda rdd: rdd.toDF().sort(desc("count")) .limit(10).registerTempTable("tweets"))) import time from IPython import display import matplotlib.pyplot as plt import seaborn as sns import pandas # get_ipython().run_line_magic('matplotlib', 'inline') ssc.start() count = 0while count < 10: time.sleep(10) top_10_tweets = sqlContext.sql("select tag,count from tweets") top_10_df = top_10_tweets.toPandas() display.clear_output(wait = True) plt.figure(figsize= (10, 8)) sns.barplot(x="count", y="tag",data=top_10_df) plt.show() count = count+1 ssc.stop()
Wednesday, October 10, 2018
Use Jupyter Notebook with Spark2 on Apache Spark on MacOS
Reference url: http://www.dbaglobe.com/2018/07/use-jupyter-notebook-with-spark2-on.html
Below are changes specific for apache spark, anaconda3 on MacOS
mkdir /Users/donghua/anaconda3/share/jupyter/kernels/pyspark2/
[root@cdh-vm bin]# cat /Users/donghua/anaconda3/share/jupyter/kernels/pyspark2/kernel.json
{
"argv": [
"python3.6",
"-m",
"ipykernel_launcher",
"-f",
"{connection_file}"
],
"display_name": "Python3.6 + Pyspark(Spark 2.3.2)",
"language": "python",
"env": {
"PYSPARK_PYTHON": "python",
"SPARK_HOME": "/Users/donghua/spark-2.3.2-bin-hadoop2.7",
"SPARK_CONF_DIR": "/Users/donghua/spark-2.3.2-bin-hadoop2.7/conf",
"PYTHONPATH": "/Users/donghua/spark-2.3.2-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip:/Users/donghua/spark-2.3.2-bin-hadoop2.7/python/:",
"PYTHONSTARTUP": "/Users/donghua/spark-2.3.2-bin-hadoop2.7/python/pyspark/shell.py",
"PYSPARK_SUBMIT_ARGS": "--master spark://Donghuas-MacBook-Air.local:7077 --name PySparkShell pyspark-shell"
}
}
/Users/donghua/anaconda3//bin/jupyter-notebook --ip=Donghuas-MacBook-Air.local --port 9999
#export PYSPARK_DRIVER_PYTHON=ipython
#export SPARK_HOME=/Users/donghua/spark-2.3.2-bin-hadoop2.7
#export PATH=$SPARK_HOME/bin:$SPARK_HOME/sbin:$PATH
Below are changes specific for apache spark, anaconda3 on MacOS
mkdir /Users/donghua/anaconda3/share/jupyter/kernels/pyspark2/
[root@cdh-vm bin]# cat /Users/donghua/anaconda3/share/jupyter/kernels/pyspark2/kernel.json
{
"argv": [
"python3.6",
"-m",
"ipykernel_launcher",
"-f",
"{connection_file}"
],
"display_name": "Python3.6 + Pyspark(Spark 2.3.2)",
"language": "python",
"env": {
"PYSPARK_PYTHON": "python",
"SPARK_HOME": "/Users/donghua/spark-2.3.2-bin-hadoop2.7",
"SPARK_CONF_DIR": "/Users/donghua/spark-2.3.2-bin-hadoop2.7/conf",
"PYTHONPATH": "/Users/donghua/spark-2.3.2-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip:/Users/donghua/spark-2.3.2-bin-hadoop2.7/python/:",
"PYTHONSTARTUP": "/Users/donghua/spark-2.3.2-bin-hadoop2.7/python/pyspark/shell.py",
"PYSPARK_SUBMIT_ARGS": "--master spark://Donghuas-MacBook-Air.local:7077 --name PySparkShell pyspark-shell"
}
}
/Users/donghua/anaconda3//bin/jupyter-notebook --ip=Donghuas-MacBook-Air.local --port 9999
#export PYSPARK_DRIVER_PYTHON=ipython
#export SPARK_HOME=/Users/donghua/spark-2.3.2-bin-hadoop2.7
#export PATH=$SPARK_HOME/bin:$SPARK_HOME/sbin:$PATH
Tuesday, October 9, 2018
Modify default slaves.sh to start apache-spark on MacOS
By default, Apache Spark sbin/start-all.sh will try to start worker using "ssh" to slave node, regardless we were testing using our laptop. Below modification is to start it using Bash instead of SSH.
Donghuas-MacBook-Air:sbin donghua$ diff slaves.sh slaves.sh.old
Donghuas-MacBook-Air:sbin donghua$ diff slaves.sh slaves.sh.old
92,94c92,93
< cmd="${@// /\\ } 2>&1"
< echo $cmd
< bash -c "$cmd"
---
> ssh $SPARK_SSH_OPTS "$slave" $"${@// /\\ }" \
> 2>&1 | sed "s/^/$slave: /"
96,98c95,96
< cmd="${@// /\\ } 2>&1"
< echo $cmd
< bash -c "$cmd"
---
> ssh $SPARK_SSH_OPTS "$slave" $"${@// /\\ }" \
> 2>&1 | sed "s/^/$slave: /" &
Revised code could be found here:
https://github.com/luodonghua/bigdata/blob/master/slaves.sh
And modified version of spark-conf.sh to start multiple workers could be found here:
https://github.com/luodonghua/bigdata/blob/master/spark-config.sh
Subscribe to:
Posts (Atom)