Arcpy: Summary Statistics for Mean and non-statistic value? - arcpy

I'm trying to write a script where I can calculate the average, minimum, and maximum of three separate fields. I can begin to envision, and having looked for a while through other SO posts, I know it is possible to do a dictionary or a search cursor, but given that I'm still new to python I thought that if possible calling the Summary Statistics tool would be easier.
However, it seems that one can only output the desired statistic and has no control over including in the output the other fields/rows that go along with it, as an example, in a .dbf with an id field, a state_name field, and a death rate field, the script will find the highest death rate and output that with the id, but not with the state_name. Is it possible to code this somehow in arcpy?

Have you tried:
Statistics_analysis(in_table, out_table, statistics_fields;statistics_fields..., case_field;case_field...})
Calculates summary statistics for field(s) in a table.
If you need to control which rows are input, you can makes selections (like w/arcpy.SelectLayerByAttribute_management()) prior to calculating stats.

Related

Random exhaustive (non-repeating) selection from a large pool of entries

Suppose I have a large (300-500k) collection of text documents stored in the relational database. Each document can belong to one or more (up to six) categories. I need users to be able to randomly select documents in a specific category so that a single entity is never repeated, much like how StumbleUpon works.
I don't really see a way I could implement this using slow NOT IN queries with large amount of users and documents, so I figured I might need to implement some custom data structure for this purpose. Perhaps there is already a paper describing some algorithm that might be adapted to my needs?
Currently I'm considering the following approach:
Read all the entries from the database
Create a linked list based index for each category from the IDs of documents belonging to the this category. Shuffle it
Create a Bloom Filter containing all of the entries viewed by a particular user
Traverse the index using the iterator, randomly select items using Bloom Filter to pick not viewed items.
If you track via a table what entries that the user has seen... try this. And I'm going to use mysql because that's the quickest example I can think of but the gist should be clear.
On a link being 'used'...
insert into viewed (userid, url_id) values ("jj", 123)
On looking for a link...
select p.url_id
from pages p left join viewed v on v.url_id = p.url_id
where v.url_id is null
order by rand()
limit 1
This causes the database to go ahead and do a 1 for 1 join, and your limiting your query to return only one entry that the user has not seen yet.
Just a suggestion.
Edit: It is possible to make this one operation but there's no guarantee that the url will be passed successfully to the user.
It depend on how users get it's random entries.
Option 1:
A user is paging some entities and stop after couple of them. for example the user see the current random entity and then moving to the next one, read it and continue it couple of times and that's it.
in the next time this user (or another) get an entity from this category the entities that already viewed is clear and you can return an already viewed entity.
in that option I would recommend save a (hash) set of already viewed entities id and every time user ask for a random entity- randomally choose it from the DB and check if not already in the set.
because the set is so small and your data is so big, the chance that you get an already viewed id is so small, that it will take O(1) most of the time.
Option 2:
A user is paging in the entities and the viewed entities are saving between all users and every time user visit your page.
in that case you probably use all the entities in each category and saving all the viewed entites + check whether a entity is viewed will take some time.
In that option I would get all the ids for this topic- shuffle them and store it in a linked list. when you want to get a random not viewed entity- just get the head of the list and delete it (O(1)).
I assume that for any given <user, category> pair, the number of documents viewed is pretty small relative to the total number of documents available in that category.
So can you just store indexed triples <user, category, document> indicating which documents have been viewed, and then just take an optimistic approach with respect to randomly selected documents? In the vast majority of cases, the randomly selected document will be unread by the user. And you can check quickly because the triples are indexed.
I would opt for a pseudorandom approach:
1.) Determine number of elements in category to be viewed (SELECT COUNT(*) WHERE ...)
2.) Pick a random number in range 1 ... count.
3.) Select a single document (SELECT * FROM ... WHERE [same as when counting] ORDER BY [generate stable order]. Depending on the SQL dialect in use, there are different clauses that can be used to retrieve only the part of the result set you want (MySQL LIMIT clause, SQLServer TOP clause etc.)
If the number of documents is large the chance serving the same user the same document twice is neglibly small. Using the scheme described above you don't have to store any state information at all.
You may want to consider a nosql solution like Apache Cassandra. These seem to be ideally suited to your needs. There are many ways to design the algorithm you need in an environment where you can easily add new columns to a table (column family) on the fly, with excellent support for a very sparsely populated table.
edit: one of many possible solutions below:
create a CF(column family ie table) for each category (creating these on-the-fly is quite easy).
Add a row to each category CF for each document belonging to the category.
Whenever a user hits a document, you add a column with named and set it to true to the row. Obviously this table will be huge with millions of columns and probably quite sparsely populated, but no problem, reading this is still constant time.
Now finding a new document for a user in a category is simply a matter of selecting any result from select * where == null.
You should get constant time writes and reads, amazing scalability, etc if you can accept Cassandra's "eventually consistent" model (ie, it is not mission critical that a user never get a duplicate document)
I've solved similar in the past by indexing the relational database into a document oriented form using Apache Lucene. This was before the recent rise of NoSQL servers and is basically the same thing, but it's still a valid alternative approach.
You would create a Lucene Document for each of your texts with a textId (relational database id) field and multi valued categoryId and userId fields. Populate the categoryId field appropriately. When a user reads a text, add their id to the userId field. A simple query will return the set of documents with a given categoryId and without a given userId - pick one randomly and display it.
Store a users past X selections in a cookie or something.
Return the last selections to the server with the users new criteria
Randomly choose one of the texts satisfying the criteria until it is not a member of the last X selections of the user.
Return this choice of text and update the list of last X selections.
I would experiment to find the best value of X but I have in mind something like an X of say 16?

SAP ABAP Infoset Query - SELECT SUM and Duplicate lines

I am having trouble figuring out where to start/how to get the correct output.
I am very new to ABAP coding. I am trying to create an Infoset query and need to do a bit of coding in SQ02.
I have two tables joined - one being RBKP as the header for Invoice Receipts the other is RBDRSEG for the Invoice Document Items.
The query needs to run following some irrelevant parameters/variants, but when it does so it needs to - - -
Look in RBDRSEG for all same Document numbers RBKP-BELNR EQ RBDRSEG-RBLNR
In doing so RBDRSEG may or may not have multiple line results for each Doc No.
I need to total the field RBDRSEG-DMBTR for each Doc No. Result.
(If there a 5 lines for a Doc. No.; DMBTR will have a different value for each that need to be totaled)
At this point I need the output to only show (along with other fields in RBKP) One line with the SUM of the DMBTR field for each Doc. No.
I then need to have another field showing the difference of the Field RBKP - RMWWWR which is the Invoice Total and the Total that was calculated earlier for that Doc. No. for field DMBTR.
If you could help, I would be incredibly grateful.
first you need to define a structure that will contain your selection data. An example structure for your requirement may look like this:
don't forget to activate the structure and make sure it doesn't contain errors.
now create the selection report. To use a report as the data selection method, you need to add two comments, *<QUERY_HEAD> and *<QUERY_BODY>. *<QUERY_HEAD> has to be placed where your start-of-selecton usually would go, *<QUERY_BODY> inside a loop that puts the selected lines into an internal table with the same name as the structure you defined in SE11.
I made an example report to show how this would work:
REPORT ZSTACK_RBKP_INFOSET_QUERY.
tables:
rbkp,
ZSTACK_RBKP_INFOSET_STR.
select-OPTIONS:
so_belnr for rbkp-belnr,
so_gjahr for rbkp-gjahr.
data:
itab type standard table of ZSTACK_RBKP_INFOSET_STR,
wa_itab type ZSTACK_RBKP_INFOSET_STR.
data:
lv_diff type dmbtr.
*here your selection starts.
*<QUERY_HEAD>
select rbkp~belnr
rbkp~gjahr
rbkp~rmwwr
rbkp~waers
sum( RBDRSEG~DMBTR ) as DMBTR
from RBKP left outer join RBDRSEG
on RBDRSEG~RBLNR eq RBKP~BELNR and
RBDRSEG~RJAHR eq RBKP~GJAHR
into corresponding fields of table itab
where rbkp~belnr in so_belnr and
rbkp~gjahr in so_gjahr
group by rbkp~belnr rbkp~gjahr rbkp~rmwwr rbkp~waers.
loop at itab into wa_itab.
lv_diff = wa_itab-dmbtr - wa_itab-rmwwr.
move lv_diff to wa_itab-diff.
modify itab from wa_itab.
endloop.
* this is the part that forwards your result set to the infoset
LOOP AT itab INTO ZSTACK_RBKP_INFOSET_STR.
*<QUERY_BODY>
ENDLOOP.
the sample report first selects the RBKP lines along with a sum of RBDRSEG-DMBTR for each document in RBKP. After that, a loop updates the DIFF column with the difference between the selected columns RMWWR and DMBTR.
Unfortunately in our SAP System the table RBDRSEG is empty, so I can't test that part of the report. But you can test the report in your system by just adding a break point before the first loop and then start the report. You should then be able to have a look at the selected lines in internal table ITAB and see if the selection works as expected.
caveats in the example report: both RBKP and RBDRSEG reference different currency fields. So it may be possible your values in RMWWR and DMBTR are in different currencies (RMWWR is in document currency, DMBTR seems to be in default company currency). If that can be the case, you will have to convert them into the appropriate currency before calculating the difference. Please make sure to join RBKP and RBDRSEG using both the document number in BELNR/RBLNR and the year in GJAHR/RJAHR (field in RBDRSEG is RJAHR, not GJAHR, although GJAHR also exists in RBDRSEG).
when your report works as expected, create the infoset based on your report. You can then use the infoset like any other infoset.
Update:
I just realized that because you wrote about being new to ABAP I immediately assumed you need to create a report for your infoset. Depending on your actual requirements this may not be the case. You could create a simple infoset query over table RBKP and then use the infoset editor to add two more fields for the line total and the difference, then add some abap code that selects the sum of all corresponding lines in RBDRSEG and calculates the difference between RMWWR and that aggregated sum. This would probably be slower than a customized abap report as the select would have to be repeated for each line in RBKP, so it really depends on the amount of data your users are going to query. A customized ABAP report is fine, flexible and quick but may be overkill and the number of people able to change a report is smaller than the number of people able to modify an infoset.
Additional Info on the variant using the infoset designer
first create a simple infoset reading only table RBKP (so no table join in the infoset definition). Now go to application-specific enhancements:
In my example I already added 2 fields, LINETOTAL and DIFFERENCE. Both have the same properties as RBDRSEG-DMBTR. Make sure your field containing the sum of RBDRSEG-DMBTR has a lower sequence (here '1') than the field containing the difference. The sequence determines which fields will be calculated first.
Click on the coding button for the first field and add the coding to select the sum for a single RBKP entry:
Then do the same for the difference field:
Now you have both fields available in your field list, you can add them to your field group on the right:
As mentioned before, the code you just entered will be processed for each line in RBKP. So this might have a huge impact on runtime performance, depending on the size of your initial result set.

Struggling with a solr query and relevance

I have a problem with boosting when using Solr. We recently switched from Lucene to Solr.
We have 4 (primary) search fields that we search against: essence, keywords, allSearchable, and quality; where, for each document in the index, essence contains the first 3 non-stop words in keywords. 'keywords' is just a list of keywords. And 'allSearchable' holds data that is just a collection of other data for a given document. What we did in lucene was to do 3 searches for any given search that a user typed into the search box (in order to rank the search results by relevance), like so:
word typed into searchbox: tree
Query 1: +essence:tree (sort by 'quality')
if Query 1 returns enough for the page we're wanting to get, then return.
Query 2: +keywords:tree (sort by 'quality')
if the combination of Query 1 and Query 2 returned enough results for the page we're on, then return the results.
Query 3: +allSearchable:tree (sort by 'quality')
Return the results. If there aren't any, then tough luck.
My problem is with pagination. I did not used to have to send pagination (startIndex, rows) to Lucene. I could just ask for everything, and then roll over everything that I get back, collecting enough results to return, depending on the page I was asking for. With Solr, I must pass pagination parameters. We have over 8 million documents in our index, so to get everything that matches a query like 'tree' is way too expensive. The problem is that if I ask for page 3 in Query 1, and I don't get enough results, then I must go on to query 2 (keywords:tree). But this isn't right, because I am asking for page 3's results for query 2 (in other words, give me all documents that match 'keywords:tree' for page 3). But that's not really the question I want to ask. I only want to ask for page 1 of keywords if essence doesn't match anything. And so on.
What I am really looking for is ONE query, that would suffice for these three queries that I did before, such that I get back the essence matches first, the keyword matches second, and the allSearchable matches last.
I tried using boosting with this query: essence:tree^4.0 keywords:tree^2.0 allSearchable:tree^1.0
But this doesn't seem to do the trick, and I don't know why? I took out the sorts, and things still don't give me back the correct results. I am using the default StandardRequestHandler (which seems to use the LuceneQueryParser (not dismax or edismax). I can see that boosts are being sent to solr in the URL (I use boosting by adding a qf parameter to the defaults section of my requestHandler in solrconfig.xml). I certainly know that lucene can understand these parameters. Can anyone tell me how I might be able to construct one query that would allow me to get results like I want as outlined above?enter code here
I would recommend using the ExtendedDismax Query Parser (eDisMax) and you can then specify the boosting across the fields as shown in the example below:
http://localhost:8983/solr/select/?q=tree
&defType=edismax&qf=essence^4.0+keywords^2.0+allSearchable^1.0
You might need to adjust the boosting values up or down across the fields to get the desired results. Plus there are additional parameters for eDisMax that effect the boosting and how the query is executed that you should examine.

Table Column Value Calculations thru Groovy Scripting

At the outset, let me confess that I am a newbie to Groovy Script but have prior scripting experience in other languages.
I am trying to calculate a sum of field values entered by users in a web form based on an Oracle Table. This Table has approx 5 sections with about 5 questions in each section. Each question has a drop down list like Yes, No & N/A. All the questions (read Table Columns) in each section are named like SEC_1_Q1, SEC_1_Q2,.... SEC_2_Q1,SEC_2_Q2,.... etc.
Depending upon the value (Yes, No, N/A) a user picks a score (1,2,0) is assigned to the corresponding question. Each section has its own total ( which becomes a sub-total for the entire table, so there will be 5 sub-totals) and a Grand Total - which is a sum of all the section totals. All the subtotals and the Grand total are visible to the users and are updated OnItemChange for any question.
This is how I visualize creating a script for this
First go thru the Column Names matching an expression and create an array or some kind of list and get the user provided Values , sum them up and update the sub-totals and the grand total so a user can see them.
This will avoid hard coding the Column Names and will also provide future room for including/excluding questions from the current sections.
I have looked hard and searched on the web, but haven't come across any ideas for Groovy Scripting.
I am sure there are many other great options to achieve this thru Java and other languages, but my choice right now is limited to Groovy.
I was wondering if any/all good souls out there can point me to the right direction(s). Any cooked, half-cooked Groovy script examples will be greatly appreciated.

How do I build a sum report to feed a Salesforce gas gauge dashboard?

I'm trying to build what I think would be a simple report to feed a salesforce gas gauge dashboard component.
The end goal is to have a goal amount (easy enough to set in the formatting parameters of the gas gauge component) that moves the needle based on the sum of an "amounts" field in closed opportunities.
No matter how I group the reports the only options for a value in the "Component Data" tab of the gas guage itself are: Auto, Average Age, and Record Count... all pretty useless for what I'm trying to do. What I really want to use is the total amount... a sum.
Any help appreciated!
The groups available on the dashboards are based on the summarised fields within the report.
It sounds like your report summarises the age (and the standard Record Count), but there is no summary involving the Amount field.
Go to your report and summarise the Amount field. You can do this by clicking the down arrow next to the 'Amount' column and select the 'Summarize this field'. Once it's summarised, it should appear as an option on your gauge.

Resources