Showing posts with label Cloudera. Show all posts
Showing posts with label Cloudera. Show all posts

Sunday, March 24, 2019

Security Master Page

TLS, SSL, HTTPS
  • Diagnosing TLS, SSL, and HTTPS
Kerberos
  • Hadoop and Kerberos: The Madness beyond the Gate
  • Configuring Ambari and Hadoop for Kerberos using AD as the KDC

Encountered error on one cluster complaining Kafka broker is 0, although 1 is up and running

ERROR admin.TopicCommand$: org.apache.kafka.common.errors.InvalidReplicationFactorException: Replication factor: 1 larger than available brokers: 0.
[donghua@cdh5 bin]$ ./kafka-topics --create --zookeeper cdh5:2181 --topic weblogs --replication-factor 1 --partitions 2

19/03/24 15:39:42 INFO zookeeper.ZooKeeper: Client environment:user.dir=/opt/cloudera/parcels/KAFKA-3.1.1-1.3.1.1.p0.2/bin
19/03/24 15:39:42 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=cdh5:2181 sessionTimeout=30000 watcher=org.I0Itec.zkclient.ZkClient@498d318c
19/03/24 15:39:42 INFO zkclient.ZkClient: Waiting for keeper state SyncConnected
19/03/24 15:39:42 INFO zookeeper.ClientCnxn: Opening socket connection to server cdh5.dbaglobe.com/192.168.31.25:2181. Will not attempt to authenticate using SASL (unknown error)
19/03/24 15:39:42 INFO zookeeper.ClientCnxn: Socket connection established, initiating session, client: /192.168.31.25:43992, server: cdh5.dbaglobe.com/192.168.31.25:2181
19/03/24 15:39:42 INFO zookeeper.ClientCnxn: Session establishment complete on server cdh5.dbaglobe.com/192.168.31.25:2181, sessionid = 0x169ae9e3aff0021, negotiated timeout = 30000
19/03/24 15:39:42 INFO zkclient.ZkClient: zookeeper state changed (SyncConnected)
Error while executing topic command : Replication factor: 1 larger than available brokers: 0.
19/03/24 15:39:42 ERROR admin.TopicCommand$: org.apache.kafka.common.errors.InvalidReplicationFactorException: Replication factor: 1 larger than available brokers: 0.
19/03/24 15:39:42 INFO zkclient.ZkEventThread: Terminate ZkClient event thread.
19/03/24 15:39:42 INFO zookeeper.ZooKeeper: Session: 0x169ae9e3aff0021 closed
19/03/24 15:39:42 INFO zookeeper.ClientCnxn: EventThread shut down

Troubleshooting:

Check broker zookeeper path, which is /brokers/
[root@cdh5 log]# tail -f /var/log/kafka/kafka-broker-cdh5.dbaglobe.com.log 
2019-03-24 15:35:46,490 INFO kafka.utils.ZKCheckedEphemeral: Creating /brokers/ids/44 (is it secure? false)
2019-03-24 15:35:46,498 INFO kafka.utils.ZKCheckedEphemeral: Result of znode creation is: OK
2019-03-24 15:35:46,499 INFO kafka.utils.ZkUtils: Registered broker 44 at path /brokers/ids/44 with addresses: EndPoint(cdh5.dbaglobe.com,9092,ListenerName(PLAINTEXT),PLAINTEXT)
Check zookeeper.chroot configuration, which is /kafka". Change it “/” to fix the problem

[donghua@cdh5 bin]$ ./kafka-topics --create --zookeeper cdh5:2181 --topic weblogs --replication-factor 1 --partitions 2
19/03/24 15:49:53 INFO zkclient.ZkEventThread: Starting ZkClient event thread.
19/03/24 15:49:53 INFO zookeeper.ZooKeeper: Client environment:zookeeper.version=3.4.5-cdh5.14.2--1, built on 03/27/2018 20:39 GMT
19/03/24 15:49:53 INFO zookeeper.ZooKeeper: Client environment:host.name=cdh5.dbaglobe.com

19/03/24 15:49:53 INFO zookeeper.ZooKeeper: Client environment:os.version=3.10.0-957.5.1.el7.x86_64
19/03/24 15:49:53 INFO zookeeper.ZooKeeper: Client environment:user.name=donghua
19/03/24 15:49:53 INFO zookeeper.ZooKeeper: Client environment:user.home=/home/donghua
19/03/24 15:49:53 INFO zookeeper.ZooKeeper: Client environment:user.dir=/opt/cloudera/parcels/KAFKA-3.1.1-1.3.1.1.p0.2/bin
19/03/24 15:49:53 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=cdh5:2181 sessionTimeout=30000 watcher=org.I0Itec.zkclient.ZkClient@498d318c
19/03/24 15:49:53 INFO zkclient.ZkClient: Waiting for keeper state SyncConnected
19/03/24 15:49:53 INFO zookeeper.ClientCnxn: Opening socket connection to server cdh5.dbaglobe.com/192.168.31.25:2181. Will not attempt to authenticate using SASL (unknown error)
19/03/24 15:49:53 INFO zookeeper.ClientCnxn: Socket connection established, initiating session, client: /192.168.31.25:45328, server: cdh5.dbaglobe.com/192.168.31.25:2181
19/03/24 15:49:53 INFO zookeeper.ClientCnxn: Session establishment complete on server cdh5.dbaglobe.com/192.168.31.25:2181, sessionid = 0x169ae9e3aff003a, negotiated timeout = 30000
19/03/24 15:49:53 INFO zkclient.ZkClient: zookeeper state changed (SyncConnected)
19/03/24 15:49:54 INFO admin.AdminUtils$: Topic creation {"version":1,"partitions":{"1":[44],"0":[44]}}
Created topic "weblogs".
19/03/24 15:49:54 INFO zkclient.ZkEventThread: Terminate ZkClient event thread.
19/03/24 15:49:54 INFO zookeeper.ZooKeeper: Session: 0x169ae9e3aff003a closed
19/03/24 15:49:54 INFO zookeeper.ClientCnxn: EventThread shut down

Monday, March 18, 2019

RDD Lambda function on array raise syntax error in Spark 2.4

Spark 1.6

[donghua@cdh5 ~]$ pyspark
Python 2.7.5 (default, Oct 30 2018, 23:45:53) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-36)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 1.6.0
      /_/

Using Python version 2.7.5 (default, Oct 30 2018 23:45:53)
SparkContext available as sc, HiveContext available as sqlContext.
>>> rdd1 = sc.textFile('file:///tmp/postal.txt')
>>> rdd1.keyBy(lambda line: line.split('\t')[0]).map(lambda (k,v): (k, (v.split('\t')[1],v.split('\t')[2]))).take(2)
[(u'00210', (u'43.00589', u'-71.01320')), (u'01014', (u'42.17073', u'-72.60484'))]
>>> 
[donghua@cdh5 ~]$ 

Spark 2.3
[donghua@cdh5 ~]$ pyspark2
Python 2.7.5 (default, Oct 30 2018, 23:45:53) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-36)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
19/03/18 09:55:51 WARN lineage.LineageWriter: Lineage directory /var/log/spark2/lineage doesn't exist or is not writable. Lineage for this application will be disabled.
19/03/18 09:55:52 WARN lineage.LineageWriter: Lineage directory /var/log/spark2/lineage doesn't exist or is not writable. Lineage for this application will be disabled.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.3.0.cloudera4
      /_/

Using Python version 2.7.5 (default, Oct 30 2018 23:45:53)
SparkSession available as 'spark'.
>>> rdd1 = sc.textFile('file:///tmp/postal.txt')
>>> rdd1.keyBy(lambda line: line.split('\t')[0]).map(lambda (k,v): (k, (v.split('\t')[1],v.split('\t')[2]))).take(2)
[(u'00210', (u'43.00589', u'-71.01320')), (u'01014', (u'42.17073', u'-72.60484'))]
>>> 

Spark 2.4

onghuas-MacBook-Air:data donghua$ cd /Users/donghua/spark-2.4.0-bin-hadoop2.7;/Users/donghua/spark-2.4.0-bin-hadoop2.7/bin/pyspark --master spark://Donghuas-MacBook-Air.local:7077
Python 3.6.8 |Anaconda, Inc.| (default, Dec 29 2018, 19:04:46) 
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2019-03-18 09:57:59 WARN  Utils:66 - Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.4.0
      /_/

Using Python version 3.6.8 (default, Dec 29 2018 19:04:46)
SparkSession available as 'spark'.
>>> rdd1 = sc.textFile('file:///Users/donghua/spark-2.4.0-bin-hadoop2.7/data/data/postal.txt')
>>> rdd1.keyBy(lambda line: line.split('\t')[0]).map(lambda (k,v): (k, (v.split('\t')[1],v.split('\t')[2]))).take(2)
  File "", line 1
    rdd1.keyBy(lambda line: line.split('\t')[0]).map(lambda (k,v): (k, (v.split('\t')[1],v.split('\t')[2]))).take(2)
                                                            ^
SyntaxError: invalid syntax
>>> rdd1.keyBy(lambda line: line.split('\t')[0]).map(lambda v: (v[0], (v[1].split('\t')[1],v[1].split('\t')[2]))).take(2)
2019-03-18 09:59:23 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[('00210', ('43.00589', '-71.01320')), ('01014', ('42.17073', '-72.60484'))]    
>>> 

Reference: 
PEP 3113 -- Removal of Tuple Parameter Unpacking
https://www.python.org/dev/peps/pep-3113/


Tuesday, February 12, 2019

Error accessing DB: (2059, "Authentication plugin 'caching_sha2_password' cannot be loaded: /usr/lib64/mysql/plugin caching_sha2_password.so: cannot open shared object file: No such file or directory")

Hue error with MySQL 8:

Error accessing DB: (2059, "Authentication plugin 'caching_sha2_password' cannot be loaded: /usr/lib64/mysql/plugin caching_sha2_password.so: cannot open shared object file: No such file or directory")

To fix: 

ALTER USER hue IDENTIFIED WITH mysql_native_password BY 'password';

Monday, December 31, 2018

Demo code to show case Solr ingestion

# Written for Solr 7 (shipped with CDH6) and Python3

import tika
import json
import urllib3
import traceback
import os


tika.initVM()
from tika import parser
url = 'http://node02.dbaglobe.com:8983/solr/cms/update/json/docs?commit=true'
filelist = ['D:\\Temp\\Building Positive Relationships young children.pdf',
            'D:\\Temp\\Building Positive Relationships spouse n in laws.pdf']

http = urllib3.PoolManager()

for file in filelist:
    try:
        parsed = parser.from_file(file)
        #Add content to "combined" dict object        combined={}
        combined['id']=os.path.basename(file) # use file name as Doc ID        combined.update(parsed["metadata"])
        combined['content']=parsed["content"]
        combined_json = json.loads(json.dumps(combined))

        print(combined_json)

        # to clean up, execute solr command *:*        # use immutable to avoid error "This ConfigSet is immutable.", use below to create the template before create the collection        # http://node02:8983/solr/admin/configs?action=CREATE&name=myConfigSet&baseConfigSet=schemalessTemplate&configSetProp.immutable=false&wt=xml        # to search: content:"Psychologist"
        response = http.request('POST',url,body=json.dumps(combined_json),headers={'Content-Type': 'application/json'})
        print (response.data)
    except:
        print(traceback.format_exc())

Sunday, December 16, 2018

Sentry and Hive permission explained

When using Sentry, the impersonation feature of HiveServer2 is disabled and each query runs in the cluster as the configured Hive principal. Thus, each HDFS location associated with a Hive table should be readable and writable by the Hive user or group.
If you are using the HDFS ACL synchronization feature, the required HDFS permissions (r-x for SELECT-wx for INSERT, and rwx for ALL) on files are enforced automatically and maintained dynamically in response to changes in privilege grants on databases and tables. In our example, the alice user would be given r-x permission to files in tables in the sales database. Note that a grant on a URI object does not result in corresponding permissions on the location in HDFS.

Sunday, September 30, 2018

Cloudera CDH "Host Clock Offset" explained

https://www.cloudera.com/documentation/enterprise/5-8-x/topics/cm_ht_host.html#concept_pnm_cmn_yk


This is a host health test that checks if the host's system clock appears to be out-of-sync with its NTP server(s). 
The test uses the 'ntpdc -np' (if ntpd is running) or 'chronyc sources' (if chronyd is running) command to check that the host is synchronized to an NTP peer and that the absolute value of the host's clock offset from that peer is not too large. 
If the command fails, NTP is not synchronized to a server, or the host's NTP daemon is not running or cannot be contacted, the test will return "Bad" health. The 'ntpdc -np' or 'chronyc sources' output contains a row for each of the host's NTP servers. The row starting with a '*' (if ntpdc) or '^*' (if chronyc) contains the peer to which the host is currently synchronized. No row starting with a '*' or '^*' indicates that the host is not currently synchronized. 
Communication errors and too large an offset between the peer and the host time are examples of conditions that can lead to a host being unsynchronized. Make sure that UDP port 123 is open in any firewall that is in use. Check the system log for ntpd or chronyd messages related to configuration errors. 
If running ntpd, use 'ntpdc -c iostat' to verify that packets are sent and recieved between the different peers. More information about the conditions of each peer can be found by running the command 'ntpq -c as'. The output of this command includes the association ID that can be used in combination with 'ntpq -c "rv "' to get more information about the status of each such peer. The command 'ntpq -c pe' can also be used to return a summary of all peers and the reason why they are not in use. 
If running chronyd, use 'chronyc activity' to check how many NTP sources are online/offline. More information about the conditions of each peer can be found by running the command 'chronyc sourcestats'. To check chrony tracking, issue the command 'chronyc tracking'. 
If NTP is not in use on the host, this check should be disabled for the host using the configuration options shown below. Cloudera recommends using NTP for time synchronization of Hadoop clusters. A failure of this health test can indicate a problem with the host's NTP service or configuration. This test can be configured using the Host Clock Offset Thresholds host configuration setting.

Tuesday, August 21, 2018

Install banana dashboard in Cloudera (using CDH6 beta as example)


Step 1: download banana from github
[root@cdh60b ~]# wget https://github.com/lucidworks/banana/archive/release.zip

Step 2: unzip the release file into "/opt/cloudera/parcels/CDH/lib/solr/server/solr-webapp/webapp/"
[root@cdh60b ~]# unzip release.zip 

[root@cdh60b ~]# mv banana-release /opt/cloudera/parcels/CDH/lib/solr/server/solr-webapp/webapp/banana

[root@cdh60b webapp]# ls -l /opt/cloudera/parcels/CDH/lib/solr/server/solr-webapp/webapp/banana/
total 72
-rw-r--r-- 1 root root  665 Jun  4  2017 bower.json
-rw-r--r-- 1 root root 2669 Jun  4  2017 build.xml
-rw-r--r-- 1 root root 2464 Jun  4  2017 CONTRIBUTING.md
-rw-r--r-- 1 root root  262 Jun  4  2017 default.properties
-rw-r--r-- 1 root root 8478 Jun  4  2017 Gruntfile.js
-rw-r--r-- 1 root root 1531 Jun  4  2017 index.html
drwxr-xr-x 2 root root   31 Jun  4  2017 jetty-contexts
-rw-r--r-- 1 root root  610 Jun  4  2017 LICENSE.md
-rw-r--r-- 1 root root 2169 Jun  4  2017 mvn.template
-rw-r--r-- 1 root root 6990 Jun  4  2017 NOTICE.txt
-rw-r--r-- 1 root root 2176 Jun  4  2017 package.json
-rw-r--r-- 1 root root 3369 Jun  4  2017 pom.xml
-rw-r--r-- 1 root root  131 Jun  4  2017 QUICKSTART
-rw-r--r-- 1 root root 9969 Jun  4  2017 README.md
drwxr-xr-x 7 root root  134 Jun  4  2017 resources
drwxr-xr-x 8 root root  107 Jun  4  2017 src
drwxr-xr-x 4 root root  116 Jun  4  2017 test

Step 3, restart SOLR service using Cloudera manager.

http://cdh60b.dbaglobe.com:8983/solr/banana/src/index.html#/dashboard


If you want to save and load dashboards from Solr, then you need to create a collection called banana-int first. For Solr 6, here are the steps:

[donghua@cdh60b ~]$ cd /opt/cloudera/parcels/CDH/lib/solr/bin
[donghua@cdh60b bin]$ ls
init.d                   oom_solr.sh   sentryMigrationTool  solr.cmd    solr.in.cmd  zksynctool.sh
install_solr_service.sh  post          snapshotscli.sh      solrctl.sh  solr.in.sh
log4j.properties         sentrycli.sh  solr                 solrd       zkcli.sh
[donghua@cdh60b bin]$ ./solr create -c banana-int
WARNING: Using _default configset. Data driven schema functionality is enabled by default, which is
         NOT RECOMMENDED for production use.

         To turn it off:
            curl http://localhost:8983/solr/banana-int/config -d '{"set-user-property": {"update.autoCreateFields":"false"}}'

Connecting to ZooKeeper at cdh60b.dbaglobe.com:2181/solr ...
INFO  - 2018-08-21 09:48:58.220; org.apache.solr.client.solrj.impl.ZkClientClusterStateProvider; Cluster at cdh60b.dbaglobe.com:2181/solr ready
Uploading /opt/cloudera/parcels/CDH/lib/solr/server/solr/configsets/_default/conf for config banana-int to ZooKeeper at cdh60b.dbaglobe.com:2181/solr

Creating new collection 'banana-int' using command:
http://localhost:8983/solr/admin/collections?action=CREATE&name=banana-int&numShards=1&replicationFactor=1&maxShardsPerNode=-1&collection.configName=banana-int

{
  "responseHeader":{
    "status":0,
    "QTime":3084},
  "success":{"cdh60b:8983_solr":{
      "responseHeader":{
        "status":0,
        "QTime":1651},
      "core":"banana-int_shard1_replica_n1"}}}


Sunday, August 19, 2018

Configure HDFS NFS Gateway

Configure HDFS NFS Gateway

[root@cdh60 ~]# showmount -e cdh60
Export list for cdh60:
/ *

[root@cdh60 ~]# rpcinfo cdh60
   program version netid     address                service    owner
    100000    4    tcp6      ::.0.111               portmapper superuser
    100000    3    tcp6      ::.0.111               portmapper superuser
    100000    4    udp6      ::.0.111               portmapper superuser
    100000    3    udp6      ::.0.111               portmapper superuser
    100000    4    tcp       0.0.0.0.0.111          portmapper superuser
    100000    3    tcp       0.0.0.0.0.111          portmapper superuser
    100000    2    tcp       0.0.0.0.0.111          portmapper superuser
    100000    4    udp       0.0.0.0.0.111          portmapper superuser
    100000    3    udp       0.0.0.0.0.111          portmapper superuser
    100000    2    udp       0.0.0.0.0.111          portmapper superuser
    100000    4    local     /var/run/rpcbind.sock  portmapper superuser
    100000    3    local     /var/run/rpcbind.sock  portmapper superuser
    100005    1    udp       0.0.0.0.16.146         mountd     superuser
    100005    2    udp       0.0.0.0.16.146         mountd     superuser
    100005    3    udp       0.0.0.0.16.146         mountd     superuser
    100005    1    tcp       0.0.0.0.16.146         mountd     superuser
    100005    2    tcp       0.0.0.0.16.146         mountd     superuser
    100005    3    tcp       0.0.0.0.16.146         mountd     superuser
    100003    3    tcp       0.0.0.0.8.1            nfs        superuser

[root@cdh60 ~]# mkdir /hdfs_nfs_mount


root@cdh60 ~]# mount -t nfs -o vers=3,proto=tcp,nolock cdh60:/ /hdfs_nfs_mount
[root@cdh60 ~]# df -h /hdfs_nfs_mount
Filesystem               Size  Used Avail Use% Mounted on
cdh60:/                   69G  6.5G   63G  10% /hdfs_nfs_mount


Linux Permission Applies:

[root@cdh60 dsuser]# ls -ld /hdfs_nfs_mount/data/incoming/
drwxr-xr-x 3 dsuser 2584148964 96 Aug 19 12:17 /hdfs_nfs_mount/data/incoming/

[root@cdh60 dsuser]# cp employees.csv /hdfs_nfs_mount/data/incoming/
cp: cannot create regular file ‘/hdfs_nfs_mount/data/incoming/employees.csv’: Permission denied

Login as dsuser (owner of incoming folder)

Donghuas-MacBook-Air:pandas donghua$ ssh dsuser@cdh60
dsuser@cdh60's password: 
Last login: Sun Aug 19 09:24:50 2018 from 192.168.1.1

[dsuser@cdh60 ~]$ klist
Ticket cache: FILE:/tmp/krb5cc_1001
Default principal: dsuser@DBAGLOBE.COM

Valid starting       Expires              Service principal
08/19/2018 09:24:54  08/20/2018 09:24:54  krbtgt/DBAGLOBE.COM@DBAGLOBE.COM
    renew until 08/26/2018 09:24:54

[dsuser@cdh60 ~]$ cp employees.csv /hdfs_nfs_mount/data/incoming/

[dsuser@cdh60 ~]$ ls -l /hdfs_nfs_mount/data/incoming/employees.csv
-rw-r--r-- 1 dsuser 2584148964 59175 Aug 19 12:17 /hdfs_nfs_mount/data/incoming/employees.csv

[dsuser@cdh60 ~]$ hdfs dfs -ls /data/incoming/
Found 1 items
-rw-r--r--   1 dsuser supergroup      59175 2018-08-19 12:17 /data/incoming/employees.csv


[dsuser@cdh60 ~]$ hdfs dfs -cat /data/incoming/employees.csv|head -n 3
First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
Douglas,Male,8/6/1993,12:42 PM,97308,6.945,true,Marketing
Thomas,Male,3/31/1996,6:53 AM,61933,4.17,true,

[dsuser@cdh60 ~]$ cat /hdfs_nfs_mount/data/incoming/employees.csv |head -n3
First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
Douglas,Male,8/6/1993,12:42 PM,97308,6.945,true,Marketing
Thomas,Male,3/31/1996,6:53 AM,61933,4.17,true,

Login as root (act as normal nfs client user)

[root@cdh60 dsuser]# hdfs dfs -cat /data/incoming/employees.csv|head -n 3
18/08/19 12:26:56 WARN ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS]
cat: Failed on local exception: java.io.IOException: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS]; Host Details : local host is: "cdh60.dbaglobe.com/192.168.56.110"; destination host is: "cdh60.dbaglobe.com":8020; 

[root@cdh60 dsuser]# cat /hdfs_nfs_mount/data/incoming/employees.csv |head -n3
First Name,Gender,Start Date,Last Login Time,Salary,Bonus %,Senior Management,Team
Douglas,Male,8/6/1993,12:42 PM,97308,6.945,true,Marketing
Thomas,Male,3/31/1996,6:53 AM,61933,4.17,true,

Sunday, July 8, 2018

Using MySQL 8 with Hue in Cloudera CDH 5.15

Symptoms: Using MySQL 8 with Hue in Cloudera CDH 5.15

Error during adding Hue:

Unable to connect to database on host 'cdh-vm.dbaglobe.com' from host 'cdh-vm.dbaglobe.com' using the credential provided.


Error in cloudera-scm-server.log
+ exec /opt/cloudera/parcels/CDH-5.15.0-1.cdh5.15.0.p0.21/lib/hue/build/env/bin/hue is_db_alive
[08/Jul/2018 19:30:15 +0000] settings     DEBUG    DESKTOP_DB_TEST_NAME SET: /opt/cloudera/parcels/CDH-5.15.0-1.cdh5.15.0.p0.21/lib/hue/desktop/desktop-test.db
[08/Jul/2018 19:30:15 +0000] settings     DEBUG    DESKTOP_DB_TEST_USER SET: hue_test
[08/Jul/2018 04:30:23 +0000] __init__     INFO     Couldn't import snappy. Support for snappy compression disabled.
Error accessing DB: (2059, "Authentication plugin 'caching_sha2_password' cannot be loaded: /usr/lib64/mysql/plugin/caching_sha2_password.so: cannot open shared object file: No such file or directory")

How to fix:

alter user 'hue'@'%' IDENTIFIED WITH mysql_native_password BY 'my_complex_password';


Sunday, June 3, 2018

Two methods to modify HDFS custom metadata

Two methods to modify HDFS custom metadata with Cloudera Navigator

- Metadata file
    not recommended for production use as lead to small file problems
    update provided through metadata files are queued before merged

- Metadata API
    use either metadata file or API, not both
    API overwrites metadata, and take effects immediately
    


[donghua@cdh-vm data]$ hdfs dfs -ls /data/donghua/*drink*
-rw-r--r--   1 donghua hive        145 2018-06-03 11:27 /data/donghua/.drinks.csv.navigator
-rw-r--r--   1 donghua hive       5918 2018-06-03 00:07 /data/donghua/drinks.csv

[donghua@cdh-vm data]$ hdfs dfs -cat /data/donghua/.drinks.csv.navigator
{
"name":"drinks dataset"
"description": "metadata example using .drinks.csv.navigator"
"properties":{
"Dept":"myDept"
},
"tags":["external"]
}

curl -u admin:admin -X GET 'http://cdh01:7187/api/v13/entities/?query=originalName:imdb_1000.csv&limit=100&offset=0'
[donghua@cdh-vm data]$ curl -u admin:admin -X GET 'http://cdh-vm:7187/api/v13/entities/?query=originalName%3D%22imdb_1000.csv%22&limit=100&offset=0'
[ {
  "originalName" : "imdb_1000.csv",
  "originalDescription" : null,
  "sourceId" : "5",
  "firstClassParentId" : null,
  "parentPath" : "/data/donghua",
  "deleteTime" : 0,
  "extractorRunId" : "5##20",
  "customProperties" : null,
  "name" : null,
  "description" : null,
  "tags" : null,
  "properties" : {
    "__cloudera_internal__hueLink" : "http://cdh-vm:8889/filebrowser/#/data/donghua/imdb_1000.csv"
  },
  "technicalProperties" : null,
  "fileSystemPath" : "/data/donghua/imdb_1000.csv",
  "type" : "FILE",
  "size" : 91499,
  "created" : "2018-06-03T00:07:55.434Z",
  "lastModified" : "2018-06-03T00:07:55.434Z",
  "lastAccessed" : "2018-06-03T00:07:54.880Z",
  "permissions" : "rw-r--r--",
  "owner" : "donghua",
  "group" : "hive",
  "blockSize" : 134217728,
  "mimeType" : "application/octet-stream",
  "ezkeyName" : null,
  "replication" : 1,
  "metaClassName" : "fselement",
  "deleted" : false,
  "packageName" : "nav",
  "userEntity" : false,
  "sourceType" : "HDFS",
  "identity" : "20388",
  "internalType" : "fselement"
}, {
  "originalName" : "imdb_1000.csv",
  "originalDescription" : null,
  "sourceId" : "5",
  "firstClassParentId" : null,
  "parentPath" : "/user/hive/warehouse/testdb.db/imdb_1000",
  "deleteTime" : 0,
  "extractorRunId" : "5##22",
  "customProperties" : null,
  "name" : null,
  "description" : null,
  "tags" : null,
  "properties" : {
    "__cloudera_internal__hueLink" : "http://cdh-vm:8889/filebrowser/#/user/hive/warehouse/testdb.db/imdb_1000/imdb_1000.csv"
  },
  "technicalProperties" : null,
  "fileSystemPath" : "/user/hive/warehouse/testdb.db/imdb_1000/imdb_1000.csv",
  "type" : "FILE",
  "size" : 91499,
  "created" : "2018-06-03T01:06:12.920Z",
  "lastModified" : "2018-06-03T01:06:12.920Z",
  "lastAccessed" : "2018-06-03T01:06:12.920Z",
  "permissions" : "rw-r--r--",
  "owner" : "hive",
  "group" : "hive",
  "blockSize" : 134217728,
  "mimeType" : "application/octet-stream",
  "ezkeyName" : null,
  "replication" : 1,
  "metaClassName" : "fselement",
  "deleted" : false,
  "packageName" : "nav",
  "userEntity" : false,
  "sourceType" : "HDFS",
  "identity" : "22303",
  "internalType" : "fselement"
}, {
  "originalName" : "imdb_1000.csv._COPYING_",
  "originalDescription" : null,
  "sourceId" : "5",
  "firstClassParentId" : null,
  "parentPath" : "/data/donghua",
  "deleteTime" : 1527984475434,
  "extractorRunId" : "5##20",
  "customProperties" : null,
  "name" : null,
  "description" : null,
  "tags" : null,
  "properties" : null,
  "technicalProperties" : null,
  "fileSystemPath" : "/data/donghua/imdb_1000.csv._COPYING_",
  "type" : "FILE",
  "size" : 91499,
  "created" : "2018-06-03T00:07:54.880Z",
  "lastModified" : "2018-06-03T00:07:54.880Z",
  "lastAccessed" : "2018-06-03T00:07:54.880Z",
  "permissions" : "rw-r--r--",
  "owner" : "donghua",
  "group" : "hive",
  "blockSize" : 134217728,
  "mimeType" : "application/octet-stream",
  "ezkeyName" : null,
  "replication" : 1,
  "metaClassName" : "fselement",
  "deleted" : true,
  "packageName" : "nav",
  "userEntity" : false,
  "sourceType" : "HDFS",
  "identity" : "20386",
  "internalType" : "fselement"
} ]


curl -u admin:admin -X POST 'http://cdh-vm:7187/api/v13/entities/?query=originalName%3D%22imdb_1000.csv%22&limit=100&offset=0' \
-H "Content-Type:application/json" -d \
'{
"sourceId":"5",
"originalName" : "imdb_1000.csv",
"parentPath" : "/data/donghua",
"name":"imdb dataset",
"description": "metadata example using API",
"properties":{
"Dept":"myDept"
},
"tags":["external"]
}'


[donghua@cdh-vm data]$ curl -u admin:admin -X POST 'http://cdh-vm:7187/api/v13/entities/?query=originalName%3D%22imdb_1000.csv%22&limit=100&offset=0' \
> -H "Content-Type:application/json" -d \
> '{
> "sourceId":"5",
> "originalName" : "imdb_1000.csv",
> "parentPath" : "/data/donghua",
> "name":"imdb dataset",
> "description": "metadata example using API",
> "properties":{
> "Dept":"myDept"
> },
> "tags":["external"]
> }'
{
  "originalName" : "imdb_1000.csv",
  "originalDescription" : null,
  "sourceId" : "5",
  "firstClassParentId" : null,
  "parentPath" : "/data/donghua",
  "deleteTime" : 0,
  "extractorRunId" : "5##20",
  "customProperties" : null,
  "name" : "imdb dataset",
  "description" : "metadata example using API",
  "tags" : [ "external" ],
  "properties" : {
    "Dept" : "myDept",
    "__cloudera_internal__hueLink" : "http://cdh-vm:8889/filebrowser/#/data/donghua/imdb_1000.csv"
  },
  "technicalProperties" : null,
  "fileSystemPath" : "/data/donghua/imdb_1000.csv",
  "type" : "FILE",
  "size" : 91499,
  "created" : "2018-06-03T00:07:55.434Z",
  "lastModified" : "2018-06-03T00:07:55.434Z",
  "lastAccessed" : "2018-06-03T00:07:54.880Z",
  "permissions" : "rw-r--r--",
  "owner" : "donghua",
  "group" : "hive",
  "blockSize" : 134217728,
  "mimeType" : "application/octet-stream",
  "ezkeyName" : null,
  "replication" : 1,
  "metaClassName" : "fselement",
  "deleted" : false,
  "packageName" : "nav",
  "userEntity" : false,
  "sourceType" : "HDFS",
  "identity" : "20388",
  "internalType" : "fselement"
}